GB2564057A

GB2564057A - Method for analyzing source and destination of internet traffic

Info

Publication number: GB2564057A
Application number: GB1816212.3A
Authority: GB
Inventors: Zhang Dashun
Original assignee: Shanghai Yamu Communication Tech Co Ltd
Current assignee: Shanghai Yamu Communication Tech Co Ltd
Priority date: 2016-04-14
Filing date: 2016-08-17
Publication date: 2019-01-02
Also published as: WO2017177591A1; CN105704260A; JP2019514303A; CN105704260B; RU2702048C1; JP7075348B2

Abstract

The present invention provides a method for analyzing the source and destination of Internet traffic. In the method, the source and destination of Internet traffic are obtained by processing DNS logs. The method comprises the following steps: a log filtering step of filtering DNS logs that cannot reflect a real access path of a user; a log segmentation step of sequentially segmenting, according to a source IP, a timestamp difference, and a central domain, the DNS logs obtained after the log filtering step to obtain segmented access paths; and a data summarization step of summarizing all the segmented access paths. By means of the analysis method of the present invention, the source and destination of Internet traffic can be mastered, so that analysis and optimization of website traffic can be better facilitated. Furthermore, by completely knowing about the flow direction state of the entire Internet traffic, the traffic state of other websites can be analyzed and understood from the whole perspective.

Description

METHOD FOR ANALYZING SOURCE AND DESTINATION OF INTERNET

TRAFFIC

Technical Field

The disclosure relates to the field of Internet DNS name resolution, and in particular to a method for analyzing a source and a destination of Internet traffic.

Background Art

The so-called source and destination of Internet traffic refers to a series of website access paths including a certain website a user first accesses and other websites the user later accesses. There is only one mainstream approach to confirm the website traffic source, that is, to add a JavaScript monitoring code to a website page. Third-party detection tools such as Google Analytics and Baidu Analytics are the most common.

The above-described statistical methods have great limitations as follows: each website may only know the website accessed by the visitor last time, and cannot learn about the multiple websites accessed by the visitor before and know where the visitor will go after leaving it.DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow the user to access Internet more conveniently without memorizing IP strings of numbers that can be directly read by machine. “The DNS name Resolution technique” means that: when a user needs to access a website, the user needs to type its domain name in the browser; after the user presses an enter key, the browser first initiates a DNS request; and with the DNS technique, the browser obtains an IP address of the server corresponding to the domain name, and then initiates an HTTP request for the IP address. DNS logs can record a response content of each DNS request, and almost record domain name information of all user requests. However, the logs may contain too much abnormal and invalid information. For example, the server may also initiate DNS requests so as to generate a large amount of domain name information, and Internet crawlers and even network attacks may generate a large number of DNS requests. These requests are unable to reflect the real access paths of a user truly and effectively.

Currently, there are no good methods to analyze the entire access paths of Internet visitors on the market. The disclosure fill this gap, which provides a method for analyzing website traffic to know which websites it comes from and which websites it will go after leaving, by reprocessing DNS logs.

Summary of the Disclosure

In view of the above-described defects, the disclosure provides a method for analyzing a source and a destination of Internet traffic. By means of the method of the disclosure, the behavior of non-human accesses in the logs is cleaned as much as possible, so that the source and the destination of the Internet traffic can be effectively obtained.

The method for analyzing a source and a destination of Internet traffic in the disclosure, which obtains the source and the destination of the Internet traffic by processing a DNS log, includes the following steps: a log filtering step of filtering DNS logs that cannot reflect a real access path of a user; a log segmentation step of sequentially segmenting, based on a source IP, a time stamp difference and a central domain, the DNS logs obtained after the log filtering step to obtain segmented access paths; and a data summarization step of summarizing all the segmented access paths.

Preferably, by setting a black list and a white list in the log filtering step, DNS logs containing domain name requests of significant interest are retained, and DNS logs containing non-human domain name requests generated by a server are removed.

Preferably, the removal of the DNS logs further includes removing logs accessed by an enterprise IP and logs where the IP is not resolved.

Preferably, the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.

Preferably, the segmentation of logs based on the time stamp difference is to segment, based on the time stamp difference in DNS logs, the logs after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.

Preferably, the specified time length is three seconds.

Preferably, the analysis method further includes, after the step of segmenting the DNS logs based on the time stamp diiference, a merging step of converting the domain name in the access paths obtained by the segmentation into a domain, and merging continuous identical domains, so as to obtain a path of the source IP.

Preferably, the segmentation based on the central domain is to segment the path of the source IP based on the central domain, and the access path obtained after the segmentation is: source domain name n+...+source domain name 1+central domain name+destination domain name 1+..+destination domain name n, and the central domain is a domain to be mainly analyzed based on user/system requirements.

Preferably, all the access paths of the source IP, which are obtained after the segmentation step based on the central domain, are summarized in the data summarization step.

By means of the analysis method of the disclosure, the source and the destination of the Internet traffic can be mastered, so that analysis and optimization of website traffic can be better facilitated. Furthermore, by completely knowing about the flow direction of the entire Internet traffic, the traffic state of other websites can be analyzed and understood from the whole perspective so as to know everything.

Brief Description of Drawings FIGs. I (a) and 1(b) are flow diagrams of a method for analyzing a source and a destination of Internet traffic in the disclosure; and FIGs. 2(a) and 2(b) are schematic diagrams of a traffic source obtained by the method for analyzing a source and a destination of Internet traffic in the disclosure.

Description of Embodiments

The disclosure will be described in detail below with reference to the accompanying drawings and embodiments. The following embodiments are not intended to limit the invention. Variations and advantages that can be conceived by those skilled in the art are included in the disclosure without departing from the spirit and scope of the disclosure.

As mentioned above, DNS (Domain Name System) is a distributed database which provides a mapping between a domain name and an IP address on the Internet. DNS can allow a user to access Internet more conveniently without memorizing the IP strings of numbers that can be directly read by machine. When accessing a website, the user type its domain name in the browser first and press an enter key. Then the browser initiates a DNS request. With the DNS technique, the browser obtains an IP address of the server corresponding to the domain name, and then initiates an HTTP request for the IP address. The above-described steps are the DNS name resolution technique.

The DNS logs can be generated during the above-described domain name resolution process.DNS logs can record a response content of each DNS request, and can almost record domain name information of all the user requests. A format for the DNS logs will be described below: l4.***.***.10|www.baidu.com|201412110359321180.***.***. 107:180.***.***.1 08|0

Source IP | Domain Name | Time stamp | Resolution IP | Status Code i.e., the DNS log consist of “Source IP”, “Domain name”, “Time stamp”, “Resolution IP” and “Status code”.

Since the DNS log include the domain name information of all the user requests, the present inventors contemplate that the source and the destination of the website traffic is analyzed by reprocessing the DNS log. However, the log also includes a large amount of abnormal and invalid information. For example, the server may also initiate DNS requests so as to generate a large amount of domain name information, and Internet crawlers and even network attacks may generate a large number of DNS requests. These requests are unable to reflect the real access path of a user truly and effectively. Based on the above situation, the present inventors contemplate that the source and destination of Internet traffic is effectively obtained by cleaning behaviors of non-human accesses in the log as much as possible. FIG. lisa flow diagram of the method for analyzing a source and a destination of Internet traffic in the disclosure. As shown in FIG. 1, the method for analyzing a source and a destination of Internet traffic in the disclosure include the following steps.

First, DNS logs that cannot reflect the real access path of a user are filtered (step SI).As described above, since the DNS request includes many domain names that cannot truly and effectively reflect the real access path of a user, cleaning is required. For example, by setting a black list and a white list, DNS logs containing domain name requests of significant concern are retained, and DNS logs containing non-human domain name requests generated by a server are removed. The non-human domain name requests generated by a server can be removed by setting a black list. Some domain names of significant concern can be retained by setting a white list. The white list has a higher priority than that of the black list. Additionally, the removal of the DNS logs further includes removing logs accessed by an enterprise IP, and logs where the IP is not resolved, in which the enterprise IP is removed because it may generate logs accessed by multiple persons simultaneously, and affects the judgment of a personal access track; and a log with unresolved IP is removed, that is, a log with access failure is removed. Log filtering is performed with different dimensions, so that the DNS logs reflecting the real access path of a user can be obtained.

Then, the DNS logs obtained after the log filtering step is segmented based on the source IP, the time stamp difference, and the central domain, so that the segmented domain is obtained (step S2).

The detailed steps are as follows. 1) Segmentations based on the source IP (step S21) are provided. The DNS log is segmented based on the source IP so as to obtain continuous DNS logs with the same source IP over a period of time.

For example, source IP1.1.1.1 is different from source IP2.2.2.2, so that the log is segmented. It is shown as follows:

Source IP | Domain Name | Time stamp | Resolution IP | Status Code l.l.l.l|www.baidu.com|20141211035932ll80. ***.***.107;180.***.***.108|0 l.l.l.l|www.qq.com|201412110359321180.***.***. 107:180.***.***. 10810 ---------------------------------------Log cutting Line----------------------------------------- 2.2.2.2|www.baidu.com|20141211035932ll80.***.***.107:180.***.***.10810 2.2.2.2|www.qq.com|20141211035932ll80.***.***. 107:180.***.***.10810 2) Then, the logs segmented based on the source IP are segmented based on the time stamp difference (step S22).The segmentation based on the time stamp difference means that the logs after being segmented based on the source IP are segmented based on the time stamp difference in the DNS logs. If the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up (the reason for the segmentation is that the interval of the logs is so long that the two logs are considered as two different behaviors).The specified time length can be adjusted as needed. In the embodiment, the specified time length is three seconds, i.e., the log may be segmented if the intervals of the time stamps is longer than three seconds.

For example, as shown below, the DNS log of source IP2.2.2.2 is further segmented based on the time stamp difference thereof (Time stamp 20141211035932 indicates 3 (hour):59 (minute):32 (second) on December 11, 2014).

Source IP | Domain Name | Time stamp | Resolution IP | Status Code 2.2.2.2|www.baidu.com|20141211000001ll80.***.***.107:180.***.***.10810 2.2.2.2|a.qq.com|20141211000002ll80.***.***.107:180.***.***. 10810 2.2.2.2|b.baidu.com|20141211000003ll80.***.***.107:180.***.***.10810 2.2.2.2|c.tanx.com|20141211000004ll80.***.***.107:180.***.***. 10810 2.2.2.2|c.allves.com|20141211000005ll80.***.***.107:180.***.***. 10810 ---------------------------------------Log cutting line------------------------------------------- 2.2.2.2|www.sina.com|20141211000009ll80.***.***. 107:180.***.***.10810 ---------------------------------------Log cutting line------------------------------------------- 2.2.2.2|www.qq.com|201412110000015ll80.***.***. 107:180.***.***. 10810 ---------------------------------------Log cutting line------------------------------------------- 2.2.2.2|www.qq.com|201412110000019ll80.***.***. 107:180.***.***. 10810 ---------------------------------------Log cutting line------------------------------------------- 2.2.2.2|www.a.com|201412110000024ll80. ***.***,107;180.***.***.108|0 ---------------------------------------Log cutting line------------------------------------------- 2.2.2.2|www.b.com|201412110000029ll80.***.***. 107:180.***. ***.10810

As shown above, since the difference between 05 seconds in the time stamp 20141211000005 and 09 seconds in the time stamp 20141211000009 is four seconds (longer than three seconds), the log is split up. The difference between 20141211000009 and 201412110000015 is six seconds, thus the log is also split up.

As described above, the log is segmented into six parts. The source IP:2.2.2.2 in the first part of the log accessed five domain names consisting of www.baidu.com, a.qq.com, b.baidu.com, c.tanx.com, and c.allyes.com.According to a judgment method of an access behavior of a user, it can be concluded that the user actually only accesses www.baidu.com, and the remaining four domain names are only domain name requests additionally generated after the user clicks www.baidu.com, and are not the real access behaviors of a user. Therefore, it can be concluded from the first part of the log that, the user accesses the path of the domain name, that is,www.baidu.com. A method for determining the access behavior of a user mentioned here is as follows: when a user clicks a URL, some other domain names besides the domain name of the current URL are requested. All the other domain name requests after the URL domain name request can be obtained by the crawler technique, and the crawled domain name requests are matched with the domain name part segmented from the DNS log, so that the correspondence between the DNS log and the domain name actually accessed by the user can be obtained. From the correspondence obtained by this method, it can be known that this part of log reflects that the user actually accesses www.baidu.com. The second part of the log only has www.sina.com, so that www.sina.com is the domain name path accessed by the user.

After the paths of the above logs are connected, the obtained paths are shown as follows: www.baidu.com > www.sina.com > www.qq.com > www.qq.com > www.a.com > www.b.com.

Then, the paths obtained by segmentation based on the time stamp diiference are merged in accordance with the same domain, i.e., the second-level domain here, and the merged result is as follows: baidu.com>sina.com>qq.com>a.com>b.com.

The above-described path is a path among the access behaviors of the source IP, and all the access paths of all the source IP can be calculated according to such a rule. 3) Next, the above-described results are segmented based on the central domain (step S23).The central domain which is mainly analyzed based on the user/system requirements is analyzed to know that, from where the user comes to the central domain and to which domains the user then goes from the central domain.For example, a.com in the log is considered as a central domain, and it is shown as follows: baidu.com > sina.com > qq.com >a.com > b.com.

For example, the four paths of the foregoing source IP are listed below, and only the source domains of the first three layers of the central domain in each path are exemplified, and the path processing logic behind the central domain is consistent with the path processing logic prior to a processing of the central domain. The actual number of layers can be adjusted according to specific needs. They are also shown in FIG. 2(a):

Source Domain 3 > Source Domain 2 > Source Domain 1 > Central Domain

Path 1: baidu.com > sina.com > qq.com >a.com (central domain)

Path 2: sina.com > baidu.com > qq.com >a.com (central domain)

Path 3: youku.com > sina.com > baidu.com >a.com (central domain)

Path 4: baidu.com > qq.com > youku.com >a.com (central domain)

Finally, all the four access paths of the foregoing source IP are summarized in the data summarization step.The summarization diagram is shown in FIG. 2(b).

The summary of the central domain is four a. com.

The summary of the Source Domain 1 is two qq.com, one baidu.com, and one youku.com.

The summary of the Source Domain 2 is two sina.com, one baidu.com, and one qq.com.

The summary of the Source Domain 3 is twobadu.com, one sina.com, and one youku.com.

From a visualization drawing as shown in FIG. 2(b), it can be clearly seen which domains were accessed by the user accessing the central domain a.com in the last step, and which domains were accessed by the user before, and so on.

When all source IPs are processed according to the logic, the source and destination of the entire Internet traffic can be seen.

By means of the analysis method of the disclosure, the source and the destination of the Internet traffic can be mastered based on the central domain name to be analyzed, so that analysis and optimization of website traffic of central domain name website can be better facilitated. Furthermore, by completely knowing about the flow direction of the entire Internet traffic, the traffic state of other websites can be analyzed and understood from the whole perspective so as to know everything.

The above-described aspects are only the preferred embodiments of the disclosure and are not intended to limit the scope of the disclosure. Any equivalent variations and modifications made according to the claims of the disclosure should fall within the technical scope of the disclosure.

Claims

1. A method for analyzing a source and a destination of Internet traffic, which obtains the source and the destination of the Internet traffic by processing a DNS log, the method comprising the following steps: a log filtering step of filtering DNS logs that can not reflect a real access path of a user; a log segmentation step of sequentially segmenting, according to a source IP, a time stamp difference and a central domain, the DNS logs obtained after the log filtering step to obtain segmented access paths; and a data summarization step of summarizing all the segmented access paths.

2. The analysis method according to claim 1, wherein by setting a black list and a white list in the log filtering step, DNS logs containing domain name requests of significant interest are retained, and DNS logs containing non-human domain name requests generated by a server are removed.

3. The analysis method according to claim 2, wherein the removal of the DNS logs further includes removing logs accessed by an enterprise IP and logs where the IP is not resolved.

4. The analysis method according to claim 3, wherein the DNS log segmentation based on the source IP is to obtain continuous DNS logs with the same source IP over a period of time.

5. The analysis method according to claim 4, wherein the segmentation of logs based on the time stamp difference is to segment, based on the time stamp difference in DNS logs, the logs after being segmented based on the source IP, and if the time stamp difference in two DNS logs is longer than a specified time length, the two DNS logs are split up.

6. The analysis method according to claim 5, wherein the specified time length is three seconds.

7. The analysis method according to claim 6, further comprising, after the step of segmenting the DNS logs based on the time stamp diiference, a merging step of converting the domain name in the access paths obtained by the segmentation into a domain, and merging continuous identical domains, so as to obtain a path of the source IP.

8. The analysis method according to claim 7, wherein the segmentation based on the central domain is to segment the path of the source IP based on the central domain, the access path obtained after the segmentation being: source domain name n+...+source domain name 1+central domain name+destination domain name l+...+destination domain name n, and the central domain is a domain to be mainly analyzed based on user/system requirements.

9. The analysis method according to claim 8, wherein all the access paths of the source IP, which are obtained after the segmentation step based on the central domain, are summarized in the data summarization step.