WO2017177591A1 - Method for analyzing source and destination of internet traffic - Google Patents

Method for analyzing source and destination of internet traffic Download PDF

Info

Publication number
WO2017177591A1
WO2017177591A1 PCT/CN2016/095672 CN2016095672W WO2017177591A1 WO 2017177591 A1 WO2017177591 A1 WO 2017177591A1 CN 2016095672 W CN2016095672 W CN 2016095672W WO 2017177591 A1 WO2017177591 A1 WO 2017177591A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
source
dns
domain
domain name
Prior art date
Application number
PCT/CN2016/095672
Other languages
French (fr)
Chinese (zh)
Inventor
张大顺
Original Assignee
上海牙木通讯技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海牙木通讯技术有限公司 filed Critical 上海牙木通讯技术有限公司
Priority to RU2018139991A priority Critical patent/RU2702048C1/en
Priority to GB1816212.3A priority patent/GB2564057A/en
Priority to JP2018554481A priority patent/JP7075348B2/en
Publication of WO2017177591A1 publication Critical patent/WO2017177591A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies

Abstract

The present invention provides a method for analyzing the source and destination of Internet traffic. In the method, the source and destination of Internet traffic are obtained by processing DNS logs. The method comprises the following steps: a log filtering step of filtering DNS logs that cannot reflect a real access path of a user; a log segmentation step of sequentially segmenting, according to a source IP, a timestamp difference, and a central domain, the DNS logs obtained after the log filtering step to obtain segmented access paths; and a data summarization step of summarizing all the segmented access paths. By means of the analysis method of the present invention, the source and destination of Internet traffic can be mastered, so that analysis and optimization of website traffic can be better facilitated. Furthermore, by completely knowing about the flow direction state of the entire Internet traffic, the traffic state of other websites can be analyzed and understood from the whole perspective.

Description

一种互联网流量来源去向的分析方法An analysis method for the destination of Internet traffic sources 技术领域Technical field
本发明涉及互联网DNS域名解析领域,尤其涉及一种互联网流量来源去向的分析方法。The invention relates to the field of Internet DNS domain name resolution, and in particular relates to an analysis method for the destination of Internet traffic sources.
背景技术Background technique
所谓互联网流量的来源去向是指用户先访问了什么网站然后又去了什么网站等一系列的网站访问路径。关于如何确认网站的流量来源,业界的主流方式只有一种,就是在网站的页面上添加JavaScript监测代码。最常见的就是第三方检测工具,如google analytics和百度统计等。The so-called source of Internet traffic refers to a series of website access paths such as what website the user first visited and then what website. There is only one way to mainstream the traffic on your site, by adding JavaScript monitoring code to the pages of your site. The most common are third-party detection tools such as google analytics and Baidu statistics.
上述统计方法有很大的局限性,每个网站只能知道访客上一个访问的网站,没有办法获悉这个访客之前访问的多个网站,更加没有办法了解这个访客离开自己网站后会去哪里。DNS(Domain Name System,域名系统),是因特网上作为域名和IP地址相互映射的一个分布式数据库,能够使用户更方便的访问互联网,而不用去记住能够被机器直接读取的IP数串。“DNS域名解析技术”是指:当用户需要访问一个网站时,他需要在浏览器中输入这个网站的域名。敲击回车后浏览器会先发起一个DNS请求,通过DNS技术,浏览器可以获取这个域名对应的服务器IP地址,然后再对这个IP地址发起HTTP请求。The above statistical methods have great limitations. Each website can only know the website visited by the visitor. There is no way to know the multiple websites visited by this visitor, and there is no way to know where the visitor will go after leaving the website. DNS (Domain Name System) is a distributed database on the Internet that maps domain names and IP addresses to each other. It enables users to access the Internet more easily without having to remember the IP number that can be directly read by the machine. . "DNS domain name resolution technology" means that when a user needs to visit a website, he needs to enter the domain name of the website in the browser. After hitting the Enter key, the browser will first initiate a DNS request. Through the DNS technology, the browser can obtain the IP address of the server corresponding to the domain name, and then initiate an HTTP request for the IP address.
DNS日志会记录每次DNS请求的应答内容,几乎能记录所有用户请求的域名信息。不过日志中会包含太多的异常和无效的信息,例如服务器也会发起DNS请求从而产生大量的域名信息,互联网爬虫甚至网络攻击都会产生大量的DNS请求。而这些请求是无法真实有效地反应用户的真实访问路径的。The DNS log records the response content of each DNS request, and can record the domain name information requested by all users. However, the log will contain too many exceptions and invalid information. For example, the server will also initiate DNS requests to generate a large amount of domain name information. Internet crawlers and even network attacks will generate a large number of DNS requests. These requests are unable to truly and effectively reflect the user's real access path.
目前市场上没有能够很好分析互联网访客的整个访问路径的方法,而本发明则弥补了这一空缺,是一种通过对DNS日志的再处理来分析网站的流量分别来自哪些网站,离开后又去了哪些网站的方法。At present, there is no method on the market that can well analyze the entire access path of Internet visitors, and the present invention makes up for this vacancy. It is a kind of re-processing of DNS logs to analyze which websites from which the traffic of the website comes from, and after leaving, How to go to which websites.
发明内容Summary of the invention
鉴于上述缺陷的存在,本发明提出了一种互联网流量来源去向的分析方法,通过本发明 方法,尽可能地清洗掉日志中非人为访问的行为,能够有效地获得互联网流量的来源和去向。In view of the above drawbacks, the present invention proposes an analysis method for the destination of Internet traffic sources, by the present invention The method of cleaning up the non-human access behavior in the log as much as possible can effectively obtain the source and destination of Internet traffic.
本发明的一种互联网流量来源去向的分析方法,通过处理DNS日志来获得互联网流量的来源与去向,包括如下步骤:The method for analyzing the destination of the Internet traffic source of the present invention obtains the source and destination of the Internet traffic by processing the DNS log, and includes the following steps:
日志过滤步骤,过滤无法反映用户真实访问路径的DNS日志;日志切分步骤,对日志过滤步骤后获得的DNS日志根据源IP、根据时间戳之差以及根据中心域依次进行切分,获得切分后的访问路径;以及数据汇总步骤,将所有所述切分后的访问路径进行汇总。The log filtering step is used to filter the DNS logs that do not reflect the real access path of the user. The log segmentation step is performed by dividing the DNS logs obtained after the log filtering step according to the source IP address, the difference according to the timestamp, and the segment according to the central domain. a subsequent access path; and a data summary step that summarizes all of the segmented access paths.
优选地,日志过滤步骤通过设置黑白名单保留包含重点关注的域名请求的DNS日志以及去除包含服务器产生的非人为的域名请求的DNS日志。Preferably, the log filtering step retains the DNS log containing the domain name request of the focused domain by setting a black and white list and removing the DNS log containing the non-human domain name request generated by the server.
优选地,去除DNS日志进一步包括去除企业IP访问的日志以及去除没有解析IP的日志。Preferably, removing the DNS log further includes removing the log of the enterprise IP access and removing the log without the resolved IP.
优选地,根据源IP对DNS日志进行切分是获得一段时间内相同源IP的连续的DNS日志。Preferably, splitting the DNS log according to the source IP is to obtain a continuous DNS log of the same source IP for a period of time.
优选地,所述根据时间戳之差对日志进行切分是对根据源IP切分后的日志再根据DNS日志的时间戳之间的差进行切分,如果两个DNS日志的时间戳之间的差大于规定时间长度,则切开所述两个DNS日志。Preferably, the splitting the log according to the difference of the time stamps is performed by dividing the log according to the source IP and then dividing the time stamp according to the DNS log, if the time stamps of the two DNS logs are between If the difference is greater than the specified length of time, the two DNS logs are cut.
优选地,所述规定时间长度为3秒。Preferably, the predetermined length of time is 3 seconds.
优选地,根据时间戳之差对DNS日志进行切分步骤后还包括合并步骤,对切分获得的访问路径中的域名转化成域,并将连续相同的域合并,以获得所述源IP的路径。Preferably, the step of dividing the DNS log according to the difference of the timestamps further includes a step of merging, converting the domain name in the access path obtained by the segmentation into a domain, and merging consecutive consecutive domains to obtain the source IP path.
优选地,所述根据中心域进行切分是以中心域为基准对所述源IP的路径进行切分,切分后获得的访问路径为:来源域名n+…+来源域名1+中心域名+去向域名1+…+去向域名n,其中,所述中心域是根据用户/系统需求确定要重点分析的域。Preferably, the segmentation according to the central domain is performed by dividing the path of the source IP based on the central domain, and the access path obtained after the segmentation is: source domain name n+...+source domain name 1+center domain name+destination The domain name 1+...+ goes to the domain name n, wherein the central domain is a domain to be analyzed according to user/system requirements.
优选地,所述数据汇总步骤中,对根据所述中心域切分步骤后获得的所述源IP的所有访问路径进行汇总。Preferably, in the data summary step, all access paths of the source IP obtained according to the central domain segmentation step are summarized.
通过本发明的分析方法,能够掌握互联网流量的来源和去向,从而能够更好地帮助网站进行网站流量的分析和优化;进一步地,通过完整地了解整个互联网的流量的流向情况,可以从全局的角度进行分析并了解其他网站的流量情况,做到知己知彼。Through the analysis method of the invention, the source and destination of the Internet traffic can be grasped, so that the website can better help the website to analyze and optimize the website traffic; further, by completely understanding the flow of the entire Internet traffic, the global situation can be Analyze the angle and understand the traffic of other websites to know ourselves and ourselves.
附图说明DRAWINGS
图1(a)、图1(b)是本发明的互联网流量来源去向的分析方法的流程图;1(a) and 1(b) are flowcharts of an analysis method for the destination of Internet traffic sources according to the present invention;
图2(a)、图2(b)是通过本发明的互联网流量来源去向的分析方法获得的流量来源的示意图。 2(a) and 2(b) are diagrams showing the source of traffic obtained by the analysis method of the Internet traffic source destination of the present invention.
具体实施方式detailed description
以下,将结合附图和实施例对发明进行详细说明。以下实施例并不是对本发明的限制。在不背离发明构思的精神和范围下,本领域技术人员能够想到的变化和优点都被包括在本发明中。Hereinafter, the invention will be described in detail with reference to the accompanying drawings and embodiments. The following examples are not intended to limit the invention. Variations and advantages that can be conceived by those skilled in the art are included in the present invention without departing from the spirit and scope of the invention.
如前所提到的,DNS(Domain Name System,域名系统),是因特网上作为域名和IP地址相互映射的一个分布式数据库,能够使用户更方便的访问互联网,而不用去记住能够被机器直接读取的IP数串。当用户访问一个网站时,先在浏览器中输入这个网站的域名,敲击回车后浏览器会先发起一个DNS请求,通过DNS技术,浏览器可以获取这个域名对应的服务器IP地址,然后再对这个IP地址发起HTTP请求。这就是DNS域名解析技术。As mentioned before, DNS (Domain Name System) is a distributed database on the Internet that maps domain names and IP addresses to each other, enabling users to access the Internet more easily without having to remember to be able to be machined. The number of IP strings read directly. When a user visits a website, first enter the domain name of the website in the browser. After hitting the carriage return, the browser will first initiate a DNS request. Through the DNS technology, the browser can obtain the IP address of the server corresponding to the domain name, and then Initiate an HTTP request for this IP address. This is the DNS domain name resolution technology.
在上述域名解析的过程中,会产生DNS日志。DNS日志会记录每次DNS请求的应答内容,几乎能记录所有用户请求的域名信息。DNS日志的格式如下所示:In the process of domain name resolution described above, a DNS log is generated. The DNS log records the response content of each DNS request, and can record the domain name information requested by all users. The format of the DNS log is as follows:
14.***.***.10|www.baidu.com|20141211035932|180.***.***.107;180.***.***.108|014.***.***.10| www.baidu.com |20141211035932|180.***.***.107;180.***.***.108|0
源IP|域名|时间戳|解析IP|状态码Source IP|Domain Name|Timestamp|Resolve IP|Status Code
即DNS日志包括“源IP”,“域名”,“时间戳”,“解析IP”和“状态码”五部分内容。That is, the DNS log includes five parts: "source IP", "domain name", "timestamp", "resolved IP" and "status code".
由于DNS日志中包括了所有用户请求的域名信息,本发明人由此想到通过对DNS日志的再处理来分析网站的流量的来源和去向。但是DNS日志中也包括了很多异常和无效的信息,例如服务器也会发起DNS请求从而产生大量的域名信息,互联网爬虫甚至网络攻击都会产生大量的DNS请求。而这些请求是无法真实有效地反映用户的真实访问路径的。基于上述的情况,本发明人想到了通过尽可能地清洗掉日志中非人为访问的行为,来有效地获得互联网流量的来源和去向。Since the domain name information requested by all users is included in the DNS log, the inventors have thus thought of analyzing the source and destination of the traffic of the website by reprocessing the DNS logs. However, the DNS log also includes a lot of abnormal and invalid information. For example, the server will also initiate DNS requests to generate a large amount of domain name information. Internet crawlers and even network attacks will generate a large number of DNS requests. These requests cannot truly and effectively reflect the user's real access path. Based on the above situation, the inventors have thought of effectively obtaining the source and destination of Internet traffic by cleaning out the behavior of non-human access in the log as much as possible.
图1是本发明的互联网流量来源去向的分析方法的流程图。如图1所示,本发明的互联网流量来源去向的分析方法包括如下步骤。1 is a flow chart of an analysis method for the destination of Internet traffic sources according to the present invention. As shown in FIG. 1, the analysis method of the Internet traffic source destination of the present invention includes the following steps.
首先,过滤无法反映用户真实访问路径的DNS日志(步骤S1)。如前所述,由于DNS请求中包括了很多无法真实有效地反映用户的真实访问路径的域名,因此需要进行清洗。例如通过设置黑白名单保留包含重点关注的域名请求的DNS日志以及去除包含服务器产生的非人为的域名请求的DNS日志。通过设置黑名单可以去除服务器产生的非人为的域名请求。通过设置白名单可以保留重点关注的某些域名。白名单优先级高于黑名单。另外,去除DNS日志进一步包括去除企业IP访问的日志以及去除没有解析IP的日志。其中,去除企业IP,是 因为企业IP会产生多人的同时访问日志,影响对个人访问轨迹的判断;去除没有解析IP的日志,即去除访问失败的日志。通过不同的维度进行日志过滤,从而可以获得反映用户真实访问路径的DNS日志。First, the DNS log that does not reflect the user's real access path is filtered (step S1). As mentioned above, since the DNS request includes many domain names that cannot truly and effectively reflect the user's real access path, it needs to be cleaned. For example, by setting a black and white list, the DNS log containing the domain name request that is of interest is retained, and the DNS log containing the non-human domain name request generated by the server is removed. By setting a blacklist, you can remove non-human domain name requests generated by the server. Some domain names that are of interest can be retained by setting a whitelist. The whitelist has a higher priority than the blacklist. In addition, removing the DNS log further includes removing logs of enterprise IP access and removing logs that do not resolve IP. Among them, removing the enterprise IP is Because the enterprise IP will generate multiple simultaneous access logs, affecting the judgment of the personal access trajectory; removing the log without parsing the IP, that is, removing the log of the access failure. Log filtering through different dimensions, so that DNS logs reflecting the user's real access path can be obtained.
接下来对日志过滤步骤后获得的DNS日志根据源IP、根据时间戳之差以及根据中心域依次进行切分,获得切分后的域(步骤S2)。Next, the DNS logs obtained after the log filtering step are sequentially segmented according to the source IP, the difference according to the time stamp, and the central domain, to obtain the segmented domain (step S2).
详细步骤如下:The detailed steps are as follows:
1)根据源IP切分(步骤S21)。根据源IP对DNS日志进行切分是获得一段时间内相同源IP的连续的DNS日志。1) Splitting according to the source IP (step S21). Splitting DNS logs based on source IP is a continuous DNS log of the same source IP over a period of time.
例如,源IP1.1.1.1和源IP2.2.2.2是不同源IP所以将日志切分。如下所示:For example, source IP 1.1.1.1 and source IP 2.2.2.2 are different source IPs so the logs are split. As follows:
源IP|域名|时间戳|解析IP|状态码Source IP|Domain Name|Timestamp|Resolve IP|Status Code
1.1.1.1|www.baidu.com|20141211035932|180.***.***.107;180.***.***.108|01.1.1.1| www.baidu.com |20141211035932|180.***.***.107;180.***.***.108|0
1.1.1.1|www.qq.com|20141211035932|180.***.***.107;180.***.***.108|01.1.1.1| www.qq.com |20141211035932|180.***.***.107;180.***.***.108|0
---------------------------------------日志切割线--------------------------------------------------------------------------------Log cutting line -------- ---------------------------------
2.2.2.2|www.baidu.com|20141211035932|180.***.***.107;180.***.***.108|02.2.2.2| www.baidu.com |20141211035932|180.***.***.107;180.***.***.108|0
2.2.2.2|www.qq.com|20141211035932|180.***.***.107;180.***.***.108|02.2.2.2| www.qq.com |20141211035932|180.***.***.107;180.***.***.108|0
2)接下来将按照源IP切分好的日志根据时间戳之差再进行切分(步骤S22)。根据时间戳之差切分是指对根据源IP切分后的日志再根据DNS日志的时间戳之间的差值进行切分。如果两个DNS日志的时间戳之间的差大于规定时间长度,则切开这两个DNS日志(切分的原因是日志的时间间隔过久则被认为是两个不同的行为)。该规定时间长度可以根据需要调整。本实施例中,所述规定时间长度为3秒,即时间戳相隔大于3秒会被切分开。2) Next, the log divided according to the source IP is further divided according to the difference of the time stamps (step S22). The splitting according to the difference between the timestamps means that the log divided according to the source IP is further divided according to the difference between the timestamps of the DNS logs. If the difference between the time stamps of the two DNS logs is greater than the specified length of time, then the two DNS logs are cut (the reason for the split is that the log interval is considered too different for two different behaviors). The specified length of time can be adjusted as needed. In this embodiment, the predetermined length of time is 3 seconds, that is, the time stamps are separated by more than 3 seconds.
例如,对源IP2.2.2.2的DNS日志进一步根据其时间戳之差值进行切分,如下所示。(时间戳20141211035932表示2014年12月11日3点59分32秒)For example, the DNS log for source IP 2.2.2.2 is further split based on the difference in its timestamp, as shown below. (Timestamp 20141211035932 means 3:59:32 on December 11, 2014)
源IP|域名|时间戳|解析IP|状态码Source IP|Domain Name|Timestamp|Resolve IP|Status Code
2.2.2.2|www.baidu.com|20141211000001|180.***.***.107;180.***.***.108|02.2.2.2| www.baidu.com |20141211000001|180.***.***.107;180.***.***.108|0
2.2.2.2|a.qq.com|20141211000002|180.***.***.107;180.***.***.108|0 2.2.2.2| a.qq.com |20141211000002|180.***.***.107;180.***.***.108|0
2.2.2.2|b.baidu.com|20141211000003|180.***.***.107;180.***.***.108|02.2.2.2| b.baidu.com |20141211000003|180.***.***.107;180.***.***.108|0
2.2.2.2|c.tanx.com|20141211000004|180.***.***.107;180.***.***.108|02.2.2.2| c.tanx.com |20141211000004|180.***.***.107;180.***.***.108|0
2.2.2.2|c.allyes.com|20141211000005|180.***.***.107;180.***.***.108|02.2.2.2| c.allyes.com |20141211000005|180.***.***.107;180.***.***.108|0
---------------------------------------日志切割线----------------------------------------------------------------------------------Log cutting line -------- -----------------------------------
2.2.2.2|www.sina.com|20141211000009|180.***.***.107;180.***.***.108|02.2.2.2| www.sina.com |20141211000009|180.***.***.107;180.***.***.108|0
---------------------------------------日志切割线----------------------------------------------------------------------------------Log cutting line -------- -----------------------------------
2.2.2.2|www.qq.com|201412110000015|180.***.***.107;180.***.***.108|02.2.2.2| www.qq.com |201412110000015|180.***.***.107;180.***.***.108|0
---------------------------------------日志切割线----------------------------------------------------------------------------------Log cutting line -------- -----------------------------------
2.2.2.2|www.qq.com|201412110000019|180.***.***.107;180.***.***.108|02.2.2.2| www.qq.com |201412110000019|180.***.***.107;180.***.***.108|0
---------------------------------------日志切割线----------------------------------------------------------------------------------Log cutting line -------- -----------------------------------
2.2.2.2|www.a.com|201412110000024|180.***.***.107;180.***.***.108|0 ..... 2.2.2.2 | www.a.com | 201412110000024 | 180 *** *** 107; 180 *** *** 108 | 0.
---------------------------------------日志切割线----------------------------------------------------------------------------------Log cutting line -------- -----------------------------------
2.2.2.2|www.b.com|201412110000029|180.***.***.107;180.***.***.108|02.2.2.2| www.b.com |201412110000029|180.***.***.107;180.***.***.108|0
如上所示,由于时间戳20141211000005的05秒和20141211000009的09秒之间相差4秒(大于3秒),所以日志被切开。20141211000009和201412110000015之间相差6秒,所以也被切开。As shown above, since the difference between the 05 second of the time stamp 20141211000005 and the 09 second of the 20141211000009 is 4 seconds (greater than 3 seconds), the log is cut. 20141211000009 and 201412110000015 are separated by 6 seconds, so they are also cut.
如上所示,日志共被切分成了6段。第一段的日志中源IP:2.2.2.2访问了5个域名,www.baidu.coma.qq.comb.baidu.comc.tanx.comc.allyes.com。根据用户访问行为的判断方法,可以得出用户实际上只访问了www.baidu.com,剩余的4个域名只是在用户点击www.baidu.com之后附带产生的域名请求,并不是用户的真实访问行为。所以第一段日志可以得出用户访问了www.baidu.com这个域名的路径。这里提到的用户访问行为的判定方法是这样的:当一个用户点击一个url时,除了请求当前url的域名外还会请求一些其他的域名。通过爬虫技术可以获取该url域名请求后的所有其他域名请求,将爬取的一系列域名请求和 DNS日志中切分出来的域名段进行匹配可以得出该DNS日志和用户实际访问的域名的对应关系。由此方法得出的对应关系可以知道这段日志反应了用户实际是访问了www.baidu.com。第二段日志只有www.sina.com,所以www.sina.com就是用户访问的域名路径。As shown above, the log is split into six segments. The source IP of the first paragraph: 2.2.2.2 has access to 5 domain names, www.baidu.com , a.qq.com , b.baidu.com , c.tanx.com , c.allyes.com . According to the judgment method of the user's access behavior, it can be concluded that the user actually only visits www.baidu.com, and the remaining four domain names are only the domain name request that is generated after the user clicks on www.baidu.com , and is not the real visit of the user. behavior. So the first log can show the path of the user accessing the domain name www.baidu.com . The method of determining the user access behavior mentioned here is as follows: When a user clicks on a url, in addition to requesting the domain name of the current url, some other domain names are requested. The crawler technology can obtain all the other domain name requests after the url domain name request, and match the crawled domain name request with the domain name segment segmented in the DNS log to obtain the correspondence between the DNS log and the domain name actually accessed by the user. relationship. The correspondence obtained by this method can be known that this log reflects that the user actually visited www.baidu.com . The second log is only www.sina.com, so www.sina.com is the domain name path that users access.
将以上日志的路径连起来后,如下所示:After connecting the paths of the above logs, as follows:
www.baidu.com>www.sina.com>www.qq.com>www.qq.com>www.a.com>www.b.comWww.baidu.com>www.sina.com>www.qq.com>www.qq.com>www.a.com>www.b.com
再将上述根据时间戳之差切分获得的路径按相同的域进行合并,这里按二级域进行合并,合并后结果为:Then, the paths obtained by dividing the difference according to the time stamps are merged by the same domain, where the two domains are merged, and the combined result is:
baidu.com>sina.com>qq.com>a.com>b.comBaidu.com>sina.com>qq.com>a.com>b.com
上述这条路径就是该源IP所有访问行为中的一条路径,按这样的规则可以算出所有源IP的所有访问路径。The above path is one of all the access behaviors of the source IP. According to such a rule, all access paths of all source IPs can be calculated.
3)接下来根据中心域对上述结果再切分(步骤S23)。中心域是根据用户/系统需求要重点分析的域,即用户是从哪里来到中心域的,然后又从中心域去了哪些域。例如以日志中a.com为中心域,则如下所示:3) Next, the above result is further divided according to the center field (step S23). The central domain is the domain to be analyzed according to the user/system requirements, that is, where the user came from the central domain and then which domains were removed from the central domain. For example, a.com in the log is the center domain, as shown below:
baidu.com>sina.com>qq.com>a.com>b.comBaidu.com>sina.com>qq.com>a.com>b.com
例如下面是前述源IP的4个路径,并且只举例每个路径中的中心域前3层的来源域,中心域后的路径处理逻辑和处理中心域之前的路径处理逻辑是一致的。实际的层数可以根据具体需求调整。亦如图2(a)所示。For example, the following is the four paths of the foregoing source IP, and only the source domain of the first three layers of the central domain in each path is exemplified, and the path processing logic after the central domain is consistent with the path processing logic before the processing central domain. The actual number of layers can be adjusted according to specific needs. Also shown in Figure 2 (a).
来源域3>来源域2>来源域1>中心域Source Domain 3 > Source Domain 2 > Source Domain 1 > Central Domain
路径1:baidu.com>sina.com>qq.com>a.com(中心域) Path 1: baidu.com>sina.com>qq.com>a.com (central domain)
路径2:sina.com>baidu.com>qq.com>a.com(中心域) Path 2: sina.com>baidu.com>qq.com>a.com (central domain)
路径3:youku.com>sina.com>baidu.com>a.com(中心域) Path 3: youku.com>sina.com>baidu.com>a.com (central domain)
路径4:baidu.com>qq.com>youku.com>a.com(中心域) Path 4: baidu.com>qq.com>youku.com>a.com (central domain)
最后是数据汇总步骤,对前述源IP的所有4个访问路径进行汇总。汇总图如图2(b)所示。Finally, the data summary step summarizes all four access paths of the aforementioned source IP. The summary is shown in Figure 2(b).
中心域的汇总为4个a.com。The summary of the central domain is 4 a.com.
来源域1的汇总为2个qq.com,1个baicu.com,1个youku.com。The summary of source domain 1 is 2 qq.com, 1 baicu.com, and 1 youku.com.
来源域2的汇总为2个sina.com,1个baidu.com,1个qq.com。The summary of source domain 2 is 2 sina.com, 1 baidu.com, and 1 qq.com.
来源域3的汇总为2个baidu.com,1个sina.com,1个youku.com。The summary of source domain 3 is 2 baidu.com, 1 sina.com, and 1 youku.com.
通过如图2(b)这样的可视化绘图可以很清楚的看出访问中心域a.com的用户前一步是 访问了哪些域,这些域之前又访问了哪些域,以此类推。Through the visual drawing as shown in Figure 2(b), it can be clearly seen that the user who accesses the central domain a.com is Which domains were accessed, which domains were accessed before, and so on.
当把所有源IP都按照这个逻辑处理后就能看到整个互联网的流量来源和去向的情况。When all source IPs are processed according to this logic, they can see the traffic source and whereabouts of the entire Internet.
通过本发明的上述方法,能够基于要分析的中心域名而掌握其互联网流量的来源和去向,从而能够更好地帮助中心域名网站进行网站流量的分析和优化;进一步地,通过完整地了解整个互联网的流量的流向情况,可以从全局的角度进行分析并了解其他网站的流量情况,做到知己知彼。Through the above method of the present invention, the source and destination of the Internet traffic can be grasped based on the central domain name to be analyzed, thereby being able to better assist the central domain name website in analyzing and optimizing the website traffic; further, by completely understanding the entire Internet The flow of traffic can be analyzed from a global perspective and understand the traffic of other websites, so that we can know ourselves and ourselves.
综上所述仅为本发明的较佳实施例,并非用来限定本发明的实施范围。即凡依本发明申请专利范围的内容所作的等效变化与修饰,都应属于本发明的技术范畴。 The above is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. That is, equivalent changes and modifications made by the content of the patent application scope of the present invention should fall within the technical scope of the present invention.

Claims (9)

  1. 一种互联网流量来源去向的分析方法,其特征在于,通过处理DNS日志来获得互联网流量的来源与去向,包括如下步骤:An analysis method for the destination of Internet traffic sources, characterized in that the source and destination of Internet traffic are obtained by processing DNS logs, including the following steps:
    日志过滤步骤,过滤无法反映用户真实访问路径的DNS日志;The log filtering step filters the DNS logs that do not reflect the user's real access path.
    日志切分步骤,对日志过滤步骤后获得的DNS日志根据源IP、根据时间戳之差以及根据中心域依次进行切分,获得切分后的访问路径;以及The log segmentation step is performed by dividing the DNS logs obtained after the log filtering step according to the source IP, the difference according to the timestamp, and the central domain according to the central domain to obtain the access path after the segmentation;
    数据汇总步骤,将所有所述切分后的访问路径进行汇总。A data summary step that summarizes all of the segmented access paths.
  2. 根据权利要求1所述的分析方法,其特征在于,日志过滤步骤通过设置黑白名单保留包含重点关注的域名请求的DNS日志以及去除包含服务器产生的非人为的域名请求的DNS日志。The analysis method according to claim 1, wherein the log filtering step retains a DNS log containing a domain name request of a focused domain name by setting a black and white list and removing a DNS log including a non-human domain name request generated by the server.
  3. 根据权利要求2所述的分析方法,其特征在于,去除DNS日志进一步包括去除企业IP访问的日志以及去除没有解析IP的日志。The analysis method according to claim 2, wherein the removing the DNS log further comprises removing the log of the enterprise IP access and removing the log without the resolved IP.
  4. 根据权利要求3所述的分析方法,其特征在于,根据源IP对DNS日志进行切分是获得一段时间内相同源IP的连续的DNS日志。The analysis method according to claim 3, wherein the segmentation of the DNS log according to the source IP is to obtain a continuous DNS log of the same source IP for a period of time.
  5. 根据权利要求4所述的分析方法,其特征在于,所述根据时间戳之差对日志进行切分是对根据源IP切分后的日志再根据DNS日志的时间戳之间的差进行切分,如果两个DNS日志的时间戳之间的差大于规定时间长度,则切开所述两个DNS日志。The analysis method according to claim 4, wherein the segmenting the log according to the difference of the timestamps is performed by dividing the log after the source IP segmentation according to the difference between the timestamps of the DNS logs. If the difference between the time stamps of the two DNS logs is greater than the specified length of time, the two DNS logs are cut.
  6. 根据权利要求5所述的分析方法,其特征在于,所述规定时间长度为3秒。The analysis method according to claim 5, wherein the predetermined length of time is 3 seconds.
  7. 根据权利要求6所述的分析方法,其特征在于,根据时间戳之差对DNS日志进行切分步骤后还包括合并步骤,对切分获得的访问路径中的域名转化成域,并将连续相同的域合并,以获得所述源IP的路径。The analysis method according to claim 6, wherein the step of dividing the DNS log according to the difference of the timestamps further comprises the step of merging, converting the domain name in the access path obtained by the segmentation into a domain, and continuing the same The fields are merged to obtain the path of the source IP.
  8. 根据权利要求7所述的分析方法,其特征在于,所述根据中心域进行切分是以中心域为基准对所述源IP的路径进行切分,切分后获得的访问路径为:The analysis method according to claim 7, wherein the segmentation according to the central domain is performed by dividing the path of the source IP based on the central domain, and the access path obtained after the segmentation is:
    来源域名n+…+来源域名1+中心域名+去向域名1+…+去向域名n,Source domain name n+...+Source domain name 1+Center domain name+Go to domain name 1+...+Go to domain name n,
    其中,所述中心域是根据用户/系统需求确定要重点分析的域。The central domain is a domain that is to be analyzed according to user/system requirements.
  9. 根据权利要求8所述的分析方法,其特征在于,所述数据汇总步骤中,对根据所述中心域切分步骤后获得的所述源IP的所有访问路径进行汇总。 The analysis method according to claim 8, wherein in the data aggregation step, all access paths of the source IP obtained according to the central domain segmentation step are summarized.
PCT/CN2016/095672 2016-04-14 2016-08-17 Method for analyzing source and destination of internet traffic WO2017177591A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
RU2018139991A RU2702048C1 (en) 2016-04-14 2016-08-17 Method of analyzing a source and destination of internet traffic
GB1816212.3A GB2564057A (en) 2016-04-14 2016-08-17 Method for analyzing source and destination of internet traffic
JP2018554481A JP7075348B2 (en) 2016-04-14 2016-08-17 How to analyze the source and destination of Internet traffic

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610231212.X 2016-04-14
CN201610231212.XA CN105704260B (en) 2016-04-14 2016-04-14 A kind of analysis method of internet traffic source whereabouts

Publications (1)

Publication Number Publication Date
WO2017177591A1 true WO2017177591A1 (en) 2017-10-19

Family

ID=56216713

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/095672 WO2017177591A1 (en) 2016-04-14 2016-08-17 Method for analyzing source and destination of internet traffic

Country Status (5)

Country Link
JP (1) JP7075348B2 (en)
CN (1) CN105704260B (en)
GB (1) GB2564057A (en)
RU (1) RU2702048C1 (en)
WO (1) WO2017177591A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10834214B2 (en) 2018-09-04 2020-11-10 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105704260B (en) * 2016-04-14 2019-05-21 上海牙木通讯技术有限公司 A kind of analysis method of internet traffic source whereabouts
CN105763633B (en) * 2016-04-14 2019-05-21 上海牙木通讯技术有限公司 A kind of correlating method of domain name and website visiting behavior
CN107846480B (en) * 2016-09-19 2021-04-20 贵州白山云科技股份有限公司 NXDOMAIN response packet processing method and device
CN107707545B (en) * 2017-09-29 2021-06-04 深信服科技股份有限公司 Abnormal webpage access fragment detection method, device, equipment and storage medium
CN109150819B (en) * 2018-01-15 2019-06-11 北京数安鑫云信息技术有限公司 A kind of attack recognition method and its identifying system
CN110138684B (en) * 2019-04-01 2022-04-29 贵州力创科技发展有限公司 Traffic monitoring method and system based on DNS log
CN111526065A (en) * 2020-04-13 2020-08-11 苏宁云计算有限公司 Website page flow analysis method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030188119A1 (en) * 2002-03-26 2003-10-02 Clark Lubbers System and method for dynamically managing memory allocated to logging in a storage area network
CN102004883A (en) * 2010-12-03 2011-04-06 中国软件与技术服务股份有限公司 Trace tracking method for electronic files
CN105357054A (en) * 2015-11-26 2016-02-24 上海晶赞科技发展有限公司 Website traffic analysis method and apparatus, and electronic equipment
CN105704260A (en) * 2016-04-14 2016-06-22 上海牙木通讯技术有限公司 Method for analyzing where Internet traffic comes from and goes to

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1290853A2 (en) * 2000-05-26 2003-03-12 Akamai Technologies, Inc. Global load balancing across mirrored data centers
EP2245837B1 (en) * 2008-02-11 2011-12-28 Dolby Laboratories Licensing Corporation Dynamic DNS system for private networks
US8380870B2 (en) * 2009-08-05 2013-02-19 Verisign, Inc. Method and system for filtering of network traffic
RU105758U1 (en) * 2010-11-23 2011-06-20 Валентина Владимировна Глазкова ANALYSIS AND FILTRATION SYSTEM FOR INTERNET TRAFFIC BASED ON THE CLASSIFICATION METHODS OF MULTI-DIMENSIONAL DOCUMENTS

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030188119A1 (en) * 2002-03-26 2003-10-02 Clark Lubbers System and method for dynamically managing memory allocated to logging in a storage area network
CN102004883A (en) * 2010-12-03 2011-04-06 中国软件与技术服务股份有限公司 Trace tracking method for electronic files
CN105357054A (en) * 2015-11-26 2016-02-24 上海晶赞科技发展有限公司 Website traffic analysis method and apparatus, and electronic equipment
CN105704260A (en) * 2016-04-14 2016-06-22 上海牙木通讯技术有限公司 Method for analyzing where Internet traffic comes from and goes to

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10834214B2 (en) 2018-09-04 2020-11-10 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history
US11228655B2 (en) 2018-09-04 2022-01-18 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history
US11652900B2 (en) 2018-09-04 2023-05-16 At&T Intellectual Property I, L.P. Separating intended and non-intended browsing traffic in browsing history

Also Published As

Publication number Publication date
GB2564057A (en) 2019-01-02
CN105704260A (en) 2016-06-22
JP7075348B2 (en) 2022-05-25
RU2702048C1 (en) 2019-10-03
CN105704260B (en) 2019-05-21
JP2019514303A (en) 2019-05-30

Similar Documents

Publication Publication Date Title
WO2017177591A1 (en) Method for analyzing source and destination of internet traffic
CN109905288B (en) Application service classification method and device
CN103888490A (en) Automatic WEB client man-machine identification method
CN102065147A (en) Method and device for obtaining user login information based on enterprise application system
CN104579773A (en) Domain name system analysis method and device
Bhargav et al. Pattern discovery and users classification through web usage mining
Rogers et al. National Web studies: The case of Iran online
Sardar et al. Detection and confirmation of web robot requests for cleaning the voluminous web log data
CN110929185A (en) Website directory detection method and device, computer equipment and computer storage medium
KR101055871B1 (en) Apparatus and method for extracting user session information through real-time analysis of web logs
WO2017177590A1 (en) Method for associating domain name with website access behavior
Patel et al. Improve heuristics for user session identification through web server log in web usage mining
US10594809B2 (en) Aggregation of web interactions for personalized usage
WO2016173327A1 (en) Method and device for detecting website attack
Latib et al. Analysing log files for web intrusion investigation using hadoop
Shrivastava et al. Extracting knowledge from user access logs
Verma et al. Web Usage mining framework for Data Cleaning and IP address Identification
Dharmarajan et al. Discovering User Pattern Analysis from Web Log Data using Weblog Expert
Shu-yue et al. The study on the preprocessing in web log mining
Ganibardi et al. Weblog Data Structuration: A Stream-centric approach for improving session reconstruction quality
WO2014161454A1 (en) Data search method and device for semi-closed data environment
JP5061316B1 (en) Communication packet analyzer
Maheswari et al. Algorithm for Tracing Visitors' On-Line Behaviors for Effective Web Usage Mining
Chitraa et al. Web log data cleaning for enhancing mining process
TWI579717B (en) Dynamic Web site HTTP network packet and database packet auditing system and method

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 201816212

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20160817

WWE Wipo information: entry into national phase

Ref document number: 1816212.3

Country of ref document: GB

ENP Entry into the national phase

Ref document number: 2018554481

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16898406

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16898406

Country of ref document: EP

Kind code of ref document: A1