CN102262635A - One kind of web crawler system and method - Google Patents

One kind of web crawler system and method Download PDF

Info

Publication number
CN102262635A
CN102262635A CN 201010189998 CN201010189998A CN102262635A CN 102262635 A CN102262635 A CN 102262635A CN 201010189998 CN201010189998 CN 201010189998 CN 201010189998 A CN201010189998 A CN 201010189998A CN 102262635 A CN102262635 A CN 102262635A
Authority
CN
Grant status
Application
Patent type
Application number
CN 201010189998
Other languages
Chinese (zh)
Inventor
李天武
肖小剑
Original Assignee
北京启明星辰信息安全技术有限公司
北京启明星辰信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

本发明公开了一种网页爬虫系统及方法,解决现有技术中不能有效提取动态URL的技术缺陷,其中该方法包括:设置一第一去重队列;接收一目标页面;采用静态爬虫对该目标页面进行爬行;将该目标页面中该静态爬虫分析不了的统一资源定位符(URL)作为动态URL;将该动态URL提交到该第一去重队列;采用动态爬虫继续对该第一去重队列中的动态URL进行爬行。 The present invention discloses a web crawler system and method to solve the prior art can not effectively extract the dynamic URL technical defects, wherein the method comprises: providing a first weight to a queue; receiving a target page; the target static reptiles page crawl; the target page analysis of the static crawler not uniform resource locator (URL) as a dynamic URL; the dynamic URL to submit the first to re-queue; dynamic reptiles continue to go heavy on the first queue dynamic URL crawl. 本发明克服了现有技术中无法有效提取动态URL的技术缺陷,有效提高了网页搜索效率和性能,有利于维护网页的安全应用。 The present invention overcomes the prior art can not effectively extract the dynamic URL's technical defects, effectively improve the efficiency and performance of Web search, help maintain the security of web applications.

Description

一种网页爬虫系统及方法 One kind of web crawler system and method

技术领域 FIELD

[0001] 本发明涉及网页搜索技术,尤其涉及一种网页爬虫系统及方法。 [0001] The present invention relates to a web search technology, particularly to a system and method for web crawling. 背景技术 Background technique

[0002] 网络爬虫是一个自动提取网页的程序,它为搜索引擎从互联网(internet)上下载网页,是搜索引擎的重要组成。 [0002] is a web crawler automatically extracts web program that downloads web pages from the Internet (internet) search engine, is an important component of search engines. 传统爬虫从一个或若干初始网页的统一资源定位符(URL) 开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列并继续进行分析,如此周而复始,直到遍历完整个互联网后者满足系统的一定停止条件时停止。 Traditional reptiles from one or several initial pages of the Uniform Resource Locator (URL) is obtained as a URL on the initial page, in the process of crawling web pages, and continue to extract new URL into the queue and continue to be analyzed from the current page, Again and again, until a complete traversal of the Internet to meet the latter system must stop when the stop condition.

[0003] 从爬虫的应用范围而言,主要应用在搜索引擎如谷歌(Google),百度以及细分的专业搜索引擎(如工作搜索引擎等),另外就是应用在病毒样本的收集,以及网络安全的监测,云安全等方面。 [0003] In terms of the scope of application of reptiles, mainly used in search engines such as Google (Google), Baidu and breakdown of professional search engine (such as job search engines, etc.), the other is used in the virus samples collected, as well as network security monitoring, cloud security.

[0004] 根据网页中是否含有浏览器端执行的脚本,可以将网页分为动态页面和静态页面。 [0004] According to a Web page containing a script whether the browser-side implementation can be divided into dynamic web pages and static pages. 静态页面中的URL直接以超文本标记语言(HTML)超链接的方式嵌在HTML文件中,一般将这种URL称为静态URL (或静态链接),而动态页面中除了静态URL,还含有大量必须通过执行浏览器端脚本才能得到的动态URL(或动态链接)。 Static page URL directly in HTML (HTML) hyperlink way embedded in the HTML file, this URL will generally be referred to as a static URL (or statically linked), in addition to static and dynamic page URL, also contains a lot you must perform a dynamic URL (or dynamic link) browser-side script to get. 目前internet上占统治地位的浏览器端脚本语言是JavMcript语言。 Currently dominated the internet browser scripting language is JavMcript language.

[0005] 一般将只能够提取静态URL的爬虫叫做静态爬虫,而能够提取动态URL的爬虫叫做动态爬虫。 [0005] will generally only be able to extract static URL reptile reptiles called static, dynamic and able to extract the URL reptile called a dynamic crawlers.

[0006] 通过分析页面文件的HTML超链接标记,静态URL能够比较容易地提取出来。 [0006] By markers HTML hyperlinks page file, static URL can be relatively easily extracted. 对于动态URL,在页面文件里的实际上只是一段段的脚本代码,可能根本就没有HTML标记,因此通过分析超链接标记的方法是得不到相应的URL的,这就是静态爬虫的最大的不足,即静态爬虫不能得到动态URL。 For dynamic URL, the page file is really just a section of script code may simply not HTML tags, therefore the method by analyzing the hyperlink tag is not the appropriate URL, which is the biggest shortcoming of static reptiles that the reptile can not get static dynamic URL.

[0007] 有鉴于此,有待于提出一种网络爬虫技术,以有效提取动态URL。 [0007] In view of this, to be proposed a Web crawler technology to effectively extract the dynamic URL. 发明内容 SUMMARY

[0008] 本发明所要解决的技术问题是需要提供一种网页爬虫系统及方法,解决现有技术中不能有效提取动态URL的技术缺陷。 [0008] The present invention solves the technical problem is the need to provide a web crawler system and method to solve the prior art can not effectively extract the dynamic URL technical drawbacks.

[0009] 为了解决上述技术问题,本发明提供了一种网页爬虫方法,包括: [0009] To solve the above problems, the present invention provides a web crawler, comprising:

[0010] 设置一第一去重队列; [0010] provided with a first weight to a queue;

[0011] 接收一目标页面; [0011] receiving a target page;

[0012] 采用静态爬虫对该目标页面进行爬行; [0012] static crawlers crawl the target page;

[0013] 将该目标页面中该静态爬虫分析不了的统一资源定位符(URL)作为动态URL ; [0013] The static analysis reptile Uniform Resource Locator (URL) that can not be the target page as a dynamic URL;

[0014] 将该动态URL提交到该第一去重队列; [0014] The dynamic URL to be submitted to the weight of the first queue;

[0015] 采用动态爬虫继续对该第一去重队列中的动态URL进行爬行。 [0015] Continue on the first crawler dynamic queue weight to crawl dynamic URL.

[0016] 优选地,设置该第一去重队列时,进一步设置一第二去重队列;[0017] 采用静态爬虫对该目标页面进行爬行时,进一步获得该目标页面中的静态URL ; [0016] Preferably, when the weight is provided to the first queue, is further provided with a second weight to the queue; [0017] The static crawlers crawl the target page, the target further access the URL page static;

[0018] 进一步将该静态URL提交到该第二去重队列; [0018] The further static URL to be submitted to the second re-queue;

[0019] 进一步采用静态爬虫进一步对该第二去重队列中的静态URL进行爬行。 [0019] Further crawler further static weight of the second queue to crawl static URL.

[0020] 优选地,采用动态爬虫继续对该第一去重队列中的动态URL进行爬行的步骤,包括:获得动态URL提交到该第一去重队列,获得静态URL提交到该第二去重队列;采用静态爬虫继续对该第二去重队列中的静态URL进行爬行的步骤,包括:获得动态URL提交到该第一去重队列;获得静态URL提交到该第二去重队列。 [0020] Preferably, the step of dynamic crawler crawling continues to weight the first dynamic URL queue, comprising: obtaining a dynamic URL to be submitted to the first re-queue to obtain a static URL to be submitted to the second weight queue; static heavy crawler to continue the second static URL queue step of crawling, comprising: obtaining a dynamic URL to be submitted to the weight of the first queue; obtaining a second static URL to be submitted to the queue weight.

[0021] 优选地,该方法进一步包括: [0021] Preferably, the method further comprising:

[0022] 该第一去重队列中的动态URL和该第二去重队列中的静态URL均爬行完毕时,或者根据一停止条件停止爬行。 [0022] When the first weight to dynamic URL queue and the second queue to the weight-average static URL crawl is complete, according to a stop or stops creep condition.

[0023] 优选地,设置该第一去重队列和该第二去重队列的步骤,包括: [0023] Preferably, the step of setting the weight of the first queue and the second de-emphasis to the queue, comprising:

[0024] 通过数据库或者内存链表结构设置该第一去重队列和该第二去重队列。 [0024] provided to the first weight and the second queue by de-duplication database or memory queue list structure.

[0025] 为了解决上述技术问题,本发明还提供了一种网页爬虫系统,包括: [0025] To solve the above problems, the present invention also provides a web crawler system, comprising:

[0026] 设置模块,用于设置一第一去重队列; [0026] setting means for setting a first weight to a queue;

[0027] 接收模块,用于接收一目标页面; [0027] a receiving module, configured to receive a target page;

[0028] 静态爬虫模块,用于采用静态爬虫对该目标页面进行爬行; [0028] static crawler module configured to crawl the target page static crawler;

[0029] 动态爬虫模块,用于将该目标页面中该静态爬虫分析不了的统一资源定位符(URL)作为动态URL,还用于采用动态爬虫继续对该第一去重队列中的动态URL进行爬行; [0029] Dynamic crawler module, Uniform Resource Locator (URL) of the target page in the static analysis can not crawler as a dynamic URL, for further dynamic crawler proceed first to the heavy dynamic URL queue crawl;

[0030] 提交模块,用于将该动态URL提交到该第一去重队列。 [0030] The submission module for submitting the URL to the first to dynamically re-queue.

[0031] 优选地,该设置模块进一步用于设置一第二去重队列; [0031] Preferably, the module is further provided for setting the weight to a second queue;

[0032] 该静态爬虫模块进一步用于采用静态爬虫对该目标页面进行爬行时,获得该目标页面中的静态URL,并用于采用静态爬虫进一步对该第二去重队列中的静态URL进行爬行; [0032] The static module is further used for static crawler crawler to crawl when the target page, the target obtained in the static page URL, and for the second static crawler further weight to the URL queue static creep;

[0033] 该提交模块进一步用于将该静态URL提交到该第二去重队列。 [0033] The submission module is further configured to submit the URL to the static weight to the second queue.

[0034] 优选地,该动态爬虫模块用于采用动态爬虫继续对该第一去重队列中的动态URL 进行爬行,获得动态URL提交到该第一去重队列,获得静态URL提交到该第二去重队列;该静态爬虫模块用于采用静态爬虫继续对该第二去重队列中的静态URL进行爬行,获得动态URL提交到该第一去重队列,获得静态URL提交到该第二去重队列。 [0034] Preferably, the means for dynamically crawler crawler dynamic creep continues to weight the first dynamic URL queue, to obtain a dynamic URL to be submitted to the first re-queue to obtain a static URL is submitted to the second deduplication queue; it means for the static crawler crawler static creep continues to weight the second static URL queue, to obtain a dynamic URL to be submitted to the first re-queue to obtain a static URL to be submitted to the second weight queue.

[0035] 优选地,该系统进一步包括: [0035] Preferably, the system further comprising:

[0036] 停止模块,用于该第一去重队列中的动态URL和该第二去重队列中的静态URL均爬行完毕时,或者根据一停止条件停止爬行。 [0036] module to stop, when the first weight to dynamic URL queue and the second queue to re-crawling are completed for a static URL, or stop according to a crawling stop condition.

[0037] 优选地,所述设置模块用于通过数据库或者内存链表结构设置该第一去重队列和该第二去重队列。 [0037] Preferably, the module is provided by a database or memory for the list structure is provided to the first weight and the second queue to queue weight.

[0038] 与现有技术相比,本发明的一个实施例有效克服了现有技术中无法有效提取动态URL的技术缺陷。 [0038] Compared with the prior art, an embodiment of the present invention effectively overcomes the prior art can not effectively extract the dynamic URL technical drawbacks. 本发明的另一个实施例有效地将针对静态脚本语言编写的静态URL的网页搜索技术和针对动态脚本语言编写的动态URL的网页搜索技术组织在一起,能有效提取网页中的静态URL和动态URL,并分别采用静态爬虫技术和动态爬虫技术进行网页搜索,有效提高了网页搜索效率和性能,有利于维护网页的安全应用。 Another embodiment of the present invention to effectively organize the Web search technology Web search technology for static scripting language static and dynamic URL URL for dynamic scripting language together, can effectively extract web page URL static and dynamic URL and crawler technology were static and dynamic web crawler technology to search, web search effectively improve the efficiency and performance, help maintain the security of web applications.

[0039] 本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。 [0039] Other features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or learned by practice of the present invention. 本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。 The objectives and other advantages of the present invention can be in the specification, the drawings, and particularly pointed out in the structure realized and attained by the claims. 附图说明 BRIEF DESCRIPTION

[0040] 附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明,并不构成对本发明的限制。 [0040] The accompanying drawings provide a further understanding of the present invention, and constitute part of this specification, the embodiments of the invention, serve to explain the invention, not to limit the present invention. 在附图中: In the drawings:

[0041] 图1为现有技术中静态爬虫流程的原理示意图; [0041] FIG. 1 is a schematic prior art static crawler schematic flow;

[0042] 图2为本发明系统实施例的组成示意图; [0042] FIG. 2 is a schematic diagram of the composition of the present embodiment of the inventive system;

[0043] 图3为本发明方法实施例的流程示意图。 Flow diagram of an embodiment of a method [0043] FIG. 3 of the present invention.

具体实施方式 detailed description

[0044] 以下将结合附图及实施例来详细说明本发明的实施方式,借此对本发明如何应用技术手段来解决技术问题,并达成技术效果的实现过程能充分理解并据以实施。 [0044] The accompanying drawings and the following embodiments will be described in detail embodiments of the present invention, thereby fully understand how the present invention is applied to the technical means to solve the technical problem, and achieve the technical effect of implementation and accordingly embodiment.

[0045] 首先,如果不冲突,本发明实施例以及实施例中的各个特征可以相互结合,均在本发明的保护范围之内。 [0045] First, if no conflict, the embodiments of the present invention and the various features of the embodiments may be combined with each other, are within the scope of the present invention. 另外,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。 Further, the steps shown in the flowchart drawings can be executed in a computer system such as computer-executable instructions in a group, and although in the flowchart shown in a logical order, but in some cases, may be different steps shown or described herein in order execution.

[0046] 图1为现有技术中静态爬虫流程的原理示意图。 [0046] FIG. 1 is a schematic view of the principle of static crawler prior art processes. 如图1所示,该静态爬虫流程主要包括如下步骤: As shown in FIG. 1, the static crawler procedure includes the following steps:

[0047] 步骤S110,接收网页URL,对静态爬虫进行初始化处理,包括设置线程数目,爬虫深度,最大URL长度等等; [0047] step S110, the received web page URL, reptiles static initialization process, including setting the number of threads, reptiles depth, the maximum length of the URL and the like;

[0048] 步骤S120,对页面URL进行DOM树解析,获得与目标标签如<a>,<img>等相匹配的URL标签,将获得的URL标签的属性(如href =“· · · ”,src =“· · · ”等)加入到链接数组中;该链接数组用于存储一个页面所包含的URL链接数据; [0048] step S120, the DOM tree for parsing the URL page to obtain a target tag as <a>, <img> tag URL matching the like, the properties of the obtained URL tag (e.g. href = "· · ·", src = "· · ·" etc.) was added to link the array; URL link the link array is used to store the data contained in a page;

[0049] 步骤S130,将该链接数组加入到去重队列中;为了防止一个页面被爬虫反复爬, 从而造成爬一个网站永远也爬不完的情况出现,就引入了去重队列;去重队列是一个链表队列结构,每当一个新的URL加入到这个结构中时,就要与这个队列的元素(URL)进行比较,看其在链表队列结构中是否存在,如果存在了,这个新的URL就没必要入这个链表队列,如果这个新的URL不存在,就把这个新的URL加入到这个链表队列的尾部; [0049] step S130, the link is added to the array to re-queue; in order to prevent a page is repeatedly reptiles crawl, climb a situation resulting website will never appear endless climb, go heavy on the introduction of the queue; to re-queue when the queue is a linked list structure, each time a new URL is added to this structure, will be compared with this element (URL) of the queue to see if it is present in the list queue structure, if present, this new URL no need to queue into this list, if the new URL does not exist, put the new URL is added to the tail of the queue list;

[0050] 步骤S140,判断去重队列中是否还有未处理的URL,如果有则转步骤S150,否则转步骤S160 ; [0050] step S140, the determination whether to re-queue unprocessed URL, if the transfer step S150, the otherwise turn to Step S160;

[0051] 步骤S150,获取去重队列中的下一个URL页面,返回步骤S120继续执行; [0051] step S150, the re-queue to obtain a next URL page, processing returns to step S120 to continue;

[0052] 步骤S160,将去重队列中的URL数据记录下来,结束。 [0052] step S160, the URL the deduplication data queue record, ending.

[0053] 本发明的核心思想是对待分析的目标页面进行解析,将目标页面中静态爬虫分析不了的(比如由动态脚本语言(如JAVASCRIPT等)编写的URL)URL作为动态URL,以及获得由HTML超链接标记所产生的静态URL ;静态爬虫在进行页面分析时,遇到提取不了的URL(即动态URL)时,就将当前页面(静态爬虫当前正在进行分析的页面)提交给动态爬虫,并继续分析当前页面,提取能自行分析的静态URL并对静态URL进行标记。 [0053] The core idea of ​​the invention is to treat the target page analysis to parse the target page static crawler can not analysis (for example, prepared by dynamic scripting languages ​​(such as JAVASCRIPT, etc.) URL) URL as a dynamic URL, and access to the HTML the hyperlink tag generated static URL; when static reptiles during page analysis, encountered not extracted URL (ie dynamic URL), it will be the current page (static page crawler ongoing analysis) submitted to the dynamic reptiles, and continue to analyze the current page, extract static URL on their own analysis and static URL tagging. 当动态爬虫分析到该当前页面时,模拟用户的操作如点击输入字符等分析静态爬虫分析不了的URL (该当前页面中未进行标记的URL),从而得到相应的动态URL。 When the dynamic analysis of the crawler when the current page, the analog input such as a click operation of the user character or the like analysis URL (the current page is not marked URL) crawler static analysis can not, to yield the corresponding dynamic URL. CN 102262635 A CN 102262635 A

说明书 Instructions

4/7页 Page 4/7

[0054] 动态爬虫的操作,是指对由动态脚本语言编写的动态URL进行爬行操作;静态爬虫的操作,是指对由HTML超链接标记所产生的静态URL进行爬行操作。 [0054] dynamic crawler operation, refers to dynamic URL dynamic scripting language written by crawling operation; static crawler operation, refers to a static URL hyperlink from the HTML tag generated crawl operation. 其中动态爬行操作比如包括由DOM控件识别器模拟用户的点击(对按钮(button)控件)、输入文本字符(对文本框控件)或者选择(对选择控件)操作或者点击由javascript生成的超链接(如<a href =" javascript... ”),获得该操作所得到的URL。 Wherein the dynamic creep operation example includes a DOM control identifier simulated user clicks (push button (button) control), enter text characters (text box control) or selection (selection of the control) operation or a click generated by javascript hyperlink ( as <a href = "javascript ..."), the operation to obtain the resulting URL. 其中静态爬虫的主要操作比如是找出可能含有超链接的标签(如a, img, iframe等),并把其属性(如href = “···”,src = “...”)提取出来加入到去重队列中。 Mainly reptiles such as static operation is to find the label may contain hyperlinks (such as a, img, iframe, etc.), and to its properties (such as href = "···", src = "...") extracted was added to a queue to weight.

[0055] 图2为本发明系统实施例的组成示意图。 [0055] FIG. 2 is a schematic diagram of the composition of the present embodiment of the inventive system. 如图2所示,该系统实施例主要包括设置模块210、接收模块220、静态爬行模块230、动态爬行模块M0、提交模块250及停止模块260,其中: 2, this embodiment mainly includes a system module 210, a receiving module 220, module 230 static creep and dynamic creep module M0, submission module 250 and module 260 is stopped, wherein:

[0056] 设置模块210,用于设置一第一去重队列和一第二去重队列; [0056] The module 210 is provided for setting a first weight to a second queue and to re-queue;

[0057] 接收模块220,用于接收一目标页面; [0057] The receiving module 220 for receiving a target page;

[0058] 静态爬行模块230,与该设置模块210及接收模块220相连,用于采用静态爬虫对该目标页面进行爬行,获得目标页面中的由HTML超链接标记所产生的静态统一资源定位符(URL),还用于采用静态爬虫对该第二去重队列中的静态URL进行爬行,获得静态URL和动态URL ; [0058] static creep module 230, is connected to the setting module 210 and receiving module 220 for static crawlers crawl the target page, to obtain a uniform resource locator static HTML hyperlink tag produced in the target page ( URL), but also for the static weight of the crawler on the second queue to crawl static URL, URL to obtain static and dynamic URL;

[0059] 动态爬行模块M0,与该设置模块210及接收模块220相连,用于采用动态爬虫对该目标页面进行爬行,获得动态URL,还用于采用动态爬虫对该第一去重队列中的动态URL 进行爬行,获得静态URL和动态URL ;目标页面中包含由静态爬行模块230无法爬行的URL, 则说明该目标页面中含有由动态脚本语言编写的动态URL; [0059] Dynamic creep module M0, the setting module 210 is connected to a receiving module 220 and, for dynamic crawlers crawl the target page, the URL to obtain dynamic, dynamic crawler further configured to weight the first queue crawl dynamic URL, URL obtain static and dynamic URL; the target page contains a 230 can not crawl the URL still crawling module, it indicates that the target page contains a dynamic URL by the dynamic scripting language;

[0060] 提交模块250,与该设置模块210、静态爬行模块230及动态爬行模块240相连,用于将该静态URL提交到该第二去重队列,将该动态URL提交到该第一去重队列; [0060] The submission module 250, the setting module 210, a static creep and dynamic creep module 230 coupled to module 240 for the static URL is submitted to the second queue to re-submit the URL to the first dynamic deduplication queue;

[0061] 停止模块沈0,与该静态爬行模块230及动态爬行模块240相连,用于该第二去重队列中的静态URL和该第一去重队列中的动态URL均爬行完毕或者根据一停止条件停止爬行。 [0061] Shen 0 Stop module, is connected to the static creep and dynamic creep module 230 module 240 configured to weight the second queue to the first static URL and the dynamic URL weight average queue or in accordance with a creep finished stop condition to stop crawling.

[0062] 上述提交模块250可以作为本发明系统实施例中的一个独立模块。 [0062] The embodiment of the submission module 250 may be implemented as a separate module system of the present invention. 在本发明的其他实施例中,也可以作为组成部分分别集成在静态爬行模块230和动态爬行模块240中。 In other embodiments of the present invention, respectively, may be integrated as part of module 230 in a static creep and dynamic creep module 240.

[0063] 上述静态爬行模块230按照提交模块250将静态URL提交到第二去重队列中的提交时间先后,顺序进行爬行操作。 [0063] The static creep submission module 250 module 230 according to a static URL is submitted to a second time to re-submission queue successively, crawl operation sequence. 也即,先进入到该第二去重队列中的静态URL,会先执行爬行操作,后进入到该第二去重队列中的静态URL,后执行爬行操作。 That is, to go into the second queue in heavy static URL, will first perform crawling operations, after entering into the second queue to heavy static URL, after performing crawl operation.

[0064] 上述动态爬行模块240按照提交模块250将动态URL提交到第一去重队列的时间先后,顺序进行爬行操作。 [0064] The dynamic creep submission module 250 module 240 in accordance with the dynamic URL to be submitted to a first time queue has a weight, crawl operation sequence. 也即,先进入到该第一去重队列中的动态URL,会先执行爬行操作,后进入到该第一去重队列中的动态URL,后执行爬行操作。 That is, the first re-enters the first queue to dynamic URL, and to perform crawling operation, after entering into the first weight to the URL queue dynamically, after performing crawling operation.

[0065] 提交模块250用于提交静态URL时判断该第二去重队列是否存在相同的静态URL, 是则不提交该相同的静态URL。 When submitting a static URL [0065] submission module 250 for determining whether the second queue to the presence of the same weight the static URL, which is not submitted to the same static URL. 提交模块250还用于提交动态URL时判断该第一去重队列是否存在相同的动态URL,是则不提交该相同的动态URL。 Submission module 250 is further configured to determining whether the first queue to the presence of heavy dynamic URL submitting the same dynamic URL, the same is not submitted to dynamic URL.

[0066] 在实际应用中,上述静态爬行操作和动态爬行操作一般是同时进行的。 [0066] In practical applications, the above-described static creep and dynamic creep operation generally operate simultaneously.

[0067] 图3为本发明方法实施例的流程示意图。 Flow diagram of an embodiment of a method [0067] FIG. 3 of the present invention. 结合图2所示的系统实施例,图3所示的方法实施例主要包括如下步骤:[0068] 步骤S310,设置一第一去重队列和一第二去重队列; Binding system shown in Figure 2 embodiment, the method embodiment shown in FIG 3 mainly comprises the steps of: [0068] step S310, the weight is provided to a first queue and a second queue to the weight;

[0069] 步骤S320,接收一目标页面; [0069] step S320, the target receives a page;

[0070] 步骤S330,采用静态爬虫对一目标页面进行爬行,获得目标页面中的由HTML超链接标记所产生的静态统一资源定位符(URL); [0070] step S330, the static reptiles crawl to a target page, access to static uniform resource locator (URL) from the HTML hyperlink tag generated target page;

[0071] 步骤S340,将该静态URL提交到该第二去重队列; [0071] step S340, the URL is submitted to the static weight to the second queue;

[0072] 步骤S350,在步骤S320过程中,静态爬虫如果遇到无法爬行的URL,则说明该目标页面中含有由动态脚本语言编写的动态URL,此时采用动态爬虫对该目标页面进行爬行,获得该目标页面中的动态URL ; [0072] step S350, the process in step S320, if they are not static reptiles crawling URL, indicates that the target page contains a dynamic URL, prepared by the dynamic scripting language, this time the target page dynamic reptiles crawl, get the target page's dynamic URL;

[0073] 步骤S360,将该动态URL提交到该第一去重队列; [0073] step S360, the dynamic URL to submit the weight to the first queue;

[0074] 步骤S370,分别采用静态爬虫和动态爬虫,继续对该第二去重队列中的静态URL 进行爬行,获得静态URL和动态URL分别提交到该第二去重队列和该第一去重队列,并对该第一去重队列中的动态URL进行爬行,获得静态URL和动态URL分别提交到该第二去重队列和该第一去重队列,直至该第二去重队列中的静态URL和该第一去重队列中的动态URL 均爬行完毕或者根据一停止条件停止爬行。 [0074] Step S370, respectively static and dynamic crawler crawler, the crawling continues to re-queue the second static URL, to obtain static and dynamic URL URL are submitted to the queue and the second weight to the first weight to queue, and the first to re-crawling queue dynamic URL, URL static and dynamic URL obtained are submitted to the queue and the second weight to the first weight to the queue until the queue weight to the second static to the first URL and the dynamic URL queue weight average creep completed or stopped according to a stop condition crawling.

[0075] 上述步骤S310中设置该第二去重队列和该第一去重队列,可以通过数据库来设置,也可以通过内存链表结构设置。 [0075] The step S310 is provided to the second weight to the first queue and the queue weight, can be set using a database, list structure may also be provided by memory. 通过内存链表结构来设置去重队列相比通过数据库来设计去重队列性能较好,因为在内存中进行数据操作要比对数据库进行数据操作速度更快,并且内存在本地,数据库可能在本地也可能在异地,因此一个算法比较合理的内存去重队列比一个数据库去重队列性能要高一些。 Set by the memory queue list structure as compared to the heavy weight is preferably designed to go through the queue performance database because the data in memory operations than the database data faster operation, and in the local memory, a local database may be It may be in different places, so an algorithm more reasonable than a memory to re-queue queue database deduplication performance is higher.

[0076] 用数据库设置去重队列只是把解析出来的新的URL提交到数据库表中,如果数据库中存在该URL的话则不提交,如果没有的话则提交到数据库表的尾部。 [0076] The database is provided with a weight to the queue just parsed new URL submitted to the database table, if it exists in the database for the URL is not submitted, if not submitted to the end of the database table. 当一个URL爬完后,则把该数据库表的第一条未爬行的第一条数据库记录作为要爬的新的URL。 When a URL after the climb, first put the database records are not the first crawl of the database table as a new URL to climb.

[0077] 用内存链表设置去重队列只是把解析出来的新的URL提交到去重队列链表中,如果链表中存在该URL的话则不提交,如果没有的话则提交到去重队列链表的尾部。 [0077] provided with a memory linked list to re-queue just parsed new URL is submitted to the deduplication queue link list, if there is the URL of the linked list, then not submitted, if not, then submitted to the tail deduplication queue list. 当一个URL爬完后,则把该去重队列链表的第一条第一个未爬行的结构中的URL作为要爬的新的URL。 When the structure of the first non-crawling first climb after a URL, a weight go put in queue list URL as a new URL to climb.

[0078] 上述步骤S340及步骤S350在执行时并没有严格的先后顺序,而且静态爬虫和动态爬虫可以同时进行爬行。 [0078] The above-described step S340 and step S350 performed when there is no strict order, and static and dynamic crawler crawler may crawl simultaneously.

[0079] 通过一个简单的举例来说明图2所示的系统实施例和图3所示的方法实施例。 [0079] The described system shown in FIG. 2 and the embodiment shown in the method embodiment of FIG 3 by a simple example.

[0080] 比如页面P中包含有a、b、c和d四个URL,其中: [0080] P is included in the page, such as a, b, c and d are four URL, where:

[0081] a和b是由动态脚本语言生成的动态URL,且a中包含有e和f两个URL,e是由动态脚本语言生成的动态URL,而f是由静态脚本语言生成的静态URL ; [0081] a and b are generated by the dynamic URL dynamic scripting language, and contains a two URL e and f, e are generated dynamically by the dynamic scripting language URL, rather f is generated by a static URL static scripting language ;

[0082] c和d是由静态脚本语言生成的静态URL,且c中包含有e和f两个URL,g是由动态脚本语言生成的动态URL,而h是由静态脚本语言生成的静态URL。 [0082] c and d are generated by the static scenario language static URL, c and e and f contains two URL, g is generated by the dynamic scripting language dynamic URL, rather h is generated by the static scenario language static URL .

[0083] 启动动态爬行操作和静态爬行操作后,接收页面P并对页面P进行解析,获得页面P中的静态URL即c和d,以及动态URL即a和b ;将将c和d (静态URL)提交到第二去重队列,将a和b (动态URL)提交到第一去重队列;在采用静态爬虫对c进行静态爬行操作时, 获得g (动态URL)和h (静态URL),将该g提交到第一去重队列,并将该h提交到第二去重队列;在采用动态爬虫对a继续进行动态爬行操作时,获得e (动态URL)和f (静态URL),将该e提交到第一去重队列,并将该f提交到第二去重队列。 [0083] After starting the static creep and dynamic creep operation of the operation, receives the page and page P P parses the page to obtain a static URL i.e. P c and d, i.e. a URL and the dynamic and B; to the c and d (static URL) to re-submit to the second queue, the a and b (dynamic URL) is submitted to a first weight to a queue; when static crawler to crawl c static operation, to obtain G (dynamic URL) and h (static URL) submitted to the weight of g to a first queue, and to submit the second weight to the queue h; during dynamic crawler to crawl a dynamic operation proceeds to give e (dynamic URL) and f (static URL), e submitted to the first weight to a queue and submit it to the second de-emphasis f queue.

[0084] 需要说明的是,如果将e提交到第一去重队列的时间早于将g提交到第一去重队列的时间,则可以先对e进行动态爬行操作(e提交到第一去重队列的时间晚于g,则可以先对g进行动态爬行操作);如果f提交到第二去重队列的时间早于h提交到第二去重队列的时间,则可以先对f进行静态爬行操作(f提交到第二去重队列的时间晚于h,则可以先对h进行静态爬行操作)。 [0084] Incidentally, if the e first time to re-submitted to the queue is earlier than the first time to be submitted to heavy g queue, e can first dynamic creep operation (e submitted to the first to heavy queue time later than g, the g can first dynamic creep operation); if f a second time to re-submitted to the queue earlier filed h to a second time to re-queue, can first static f crawling operation (f submitted to a second later time to re-queue to h, h can first static creep operation). 也就是说,可以按照动态URL提交到第一去重队列中的先后顺序,依次进行动态爬行操作,按照静态URL提交到第二去重队列中的先后顺序,依次进行静态爬行操作。 That is, according to the dynamic URL to be submitted to the first re-order queue, the dynamic creep sequentially performed operations, to be submitted to the second weight according to the order queue static URL sequentially static creep operation.

[0085] 本发明将动态URL和静态URL的爬行操作融合在一起形成动静结合的混合爬行操作,对页面中的动态URL和静态URL分别进行爬行操作,并根据爬行操作的结果继续对URL 进行动态和静态的区分,保证了爬虫效率,提高了爬行操作的处理能力。 [0085] The present invention will be crawling operation of a static URL and the dynamic URL fused together to form a mixed combination of static and dynamic creep operation, a page of static and dynamic URL URL crawling operation separately, and continue URL dynamically according to the result of the operation of crawling and static distinction, to ensure the efficiency of reptiles, more processing power to crawl operations. 本发明中的动态爬虫执行功能和静态爬虫执行功能可以采用集成式设置,也可以采用分布式设置,提高了爬行操作的灵活性。 In the present invention perform the functions of the dynamic and static crawler crawler may perform functions provided with an integrated, distributed may be provided to improve the flexibility of crawling operation.

[0086] 动态爬虫虽然能够提取动态URL和静态URL,但是通过模拟用户的操作实现提取, 难以在需要高效的产品中单独部署(如果单独部署,则一般而言等同于无用)。 [0086] Although the crawler can be dynamic and static dynamic URL extracted URL, but the simulation is achieved by extracting a user's operation, it is difficult to separate deployment (if deployed separately, is generally equivalent to useless) in need efficient products. 当然,如果某些项目或者需求对爬虫的性能不做要求,动态爬虫也可以单独部署。 Of course, if some of the items or performance requirements of reptiles not required, dynamic reptiles can also be deployed alone.

[0087] 动态爬虫功能很强大,但因为动态爬虫需要进行IE渲染,所以性能较低,相反,性能高是静态爬虫的优势,因此将这两套爬虫结合起来就能发挥彼此的优势。 [0087] dynamic reptile very powerful, but because the dynamic reptiles require IE rendering, so the lower the performance, on the contrary, is the advantage of high performance static reptiles, so the combination of these two sets of reptiles can play each other's strengths. 静态爬虫处理静态链接,动态爬虫处理动态链接,从而达到性能得到提高,功能得到加强。 Static reptiles deal with static linking, dynamic handling reptiles dynamically linked, so as to achieve improved performance, enhanced features.

[0088] 两个爬虫的交互过程是,动态爬虫发现了静态URL就交给静态爬虫去处理,同时自己处理自己发现的动态URL ;静态爬虫发现了动态URL就交给动态爬虫去处理,同时处理自己发现的静态URL。 [0088] two reptiles interaction is dynamic reptile found on static URL to the static reptiles to deal with, while dealing with dynamic URL you have found yourself; static reptile found on a dynamic URL to reptiles to deal with dynamic, simultaneous processing They discovered a static URL.

[0089] 在本发明的一个具体应用实例中,一个任务扫描多网站的情况下,对于控制中心(或者其它利用爬虫结果的模块)而言,如果所有网站的URL都保存在分别存储静态URL和动态URL的两个表里,可能存在查询速度慢和管理不便的问题。 [0089] In an application example of the present invention, in the case of a task scanning multiple sites, the control center (or other result using the crawler block), if the URL of the site are stored in all the store static URL and two dynamic URL table, there may be slow and inconvenient query management problems. 因此设计一个网站对应一个子任务,每次执行一个子任务时,就生成四个表,其表名是随机生成的,记录在Tablelist 表中的爬虫控制(controlofspider),静态URL(urlofstaticspider),动态URL(urlofdynamicspider)和主机列表(hostlist)字段中,即爬虫交互控制表,静态页面表,云力态页面表,hostlist表。 So design a web site corresponding to a sub-task, each time you execute a sub-task, it generates four tables, table name that is randomly generated, reptiles recorded in the control Tablelist table (controlofspider), static URL (urlofstaticspider), dynamic URL (urlofdynamicspider) and host list (hostlist) field, namely reptiles interactive control table, static page table, page table cloud power state, hostlist table.

[0090] 爬虫交互控制表是控制爬虫是否完成的,其中有两个字段静态爬虫完成标志(StaticSatus)和动态爬虫完成标志(Dynamic&itus)。 [0090] The control table is a control interaction crawler crawler is completed, wherein there are two static fields crawler completion flag (StaticSatus) completion flag crawler and dynamic (Dynamic & itus). 静态爬虫爬完静态页面表时,设置StaticSatus = 0并观察DynamicSatus值,如果DynamicSatus = 0则扫描完毕;否则设置StaticSatus = 1,继续解析URL。 When the reptile climbed static pages static table, set StaticSatus = 0 and observe DynamicSatus value, if DynamicSatus = 0, the scan is completed; otherwise it is set StaticSatus = 1, continue to parse URL. 动态爬虫爬完动态页面表时,设置Dynamic&itus = 0并观察StaticSatus值,如果StaticSatus = 0则扫描完毕;否则设置DynamicSatus = 1,继续解析URL。 When dynamic crawler climbed dynamic page table provided Dynamic & itus = 0 and the value observed StaticSatus, if StaticSatus = 0 then the scanning is finished; otherwise disposed DynamicSatus = 1, continue parsing URL.

[0091] 静态页面表是用来保存动态爬虫发现的静态URL,等待静态爬虫来爬行的。 [0091] static page table is used to hold a static URL dynamic reptile found, waiting to static reptiles crawling. 当静态爬虫爬完其去重队列的所有的URL后,就访问静态页面表,发现有没有Matus为0的记录, 如果有的话,就把这些加入到其去重队列中继承爬行。 When the reptile climbed all of its static URL to re-queue, you access static page table, we found no record 0 Matus, if any, to put them added to its heavy inheritance crawling queues. 如果没有了则表示静态爬虫暂时爬完了。 If there is no static reptiles, said Pawan temporarily. 至于静态爬虫是否真正爬完所有的静态URL,需要查看静态页面表中是否还会出现新的未爬行的静态URL,如果继续出现则继续进行爬行。 As to whether the static reptiles really climbed all static URL, you need to see whether there will be a new static URL is not crawling static page table, if it continues to appear then proceed to crawl.

[0092] 动态页面表是用来保存静态爬虫发现的动态URL,等待动态爬虫来爬行的。 [0092] dynamic page table is used to store static dynamic URL reptile found, waiting to dynamic reptiles crawling. 当动态爬虫爬完其去重队列的所有的URL后,就访问动态页面表,发现有没有Matus为0的记录, 如果有的话,就把这些加入到其去重队列中继承爬行。 When the reptile climbed all its dynamic URL to re-queue, you access dynamic page table, we found no record 0 Matus, if any, to put them added to its heavy inheritance crawling queues. 如果没有了则表示动态爬虫暂时爬完了。 Without being said Pawan dynamic crawler. 至于动态爬虫是否真正爬完所有的动态URL,需要查看动态页面表中是否还会出现新的未爬行的动态URL,如果继续出现则继续进行爬行。 As to whether the real dynamic reptile climbed all the dynamic URL, you need to see whether there will be new not crawl dynamic URL dynamic page table, if it continues to appear then proceed to crawl.

[0093] 需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。 [0093] It should be noted that the steps illustrated in the flowchart drawings can be executed in a computer system a set of computer executable instructions. 另外,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。 Further, those skilled in the art should understand that the above modules or steps of the present invention may be general purpose computing device, they can be integrated in a single computing device or distributed in a network composed of multiple computing devices. on, alternatively, they may be implemented by program codes executable by a computing device, so that, to be performed by a computing device stored in a storage device, or they are made into integrated circuit modules, or they a plurality of modules or steps are manufactured into a single integrated circuit module. 这样,本发明不限制于任何特定的硬件和软件结合。 Thus, the present invention is not limited to any particular hardware and software combination.

[0094] 虽然本发明所揭露的实施方式如上,但所述的内容只是为了便于理解本发明而采用的实施方式,并非用以限定本发明。 [0094] While the disclosed embodiment of the present invention described above, the embodiment of the content only to facilitate understanding of the present invention is employed, the present invention is not limited thereto. 任何本发明所属技术领域内的技术人员,在不脱离本发明所揭露的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化, 但本发明的专利保护范围,仍须以所附的权利要求书所界定的范围为准。 Any skilled person in the art the present invention belongs art, without departing from the spirit and scope of the present invention is disclosed, and modifications may be made any changes in form and details of the embodiments, but the scope of the present invention patent, still in the appended claims define the scope of equivalents.

Claims (10)

  1. 1. 一种网页爬虫方法,其特征在于,包括: 设置一第一去重队列;接收一目标页面;采用静态爬虫对该目标页面进行爬行;将该目标页面中该静态爬虫分析不了的统一资源定位符(URL)作为动态URL;将该动态URL提交到该第一去重队列;采用动态爬虫继续对该第一去重队列中的动态URL进行爬行。 A web crawler method comprising: providing a first weight to a queue; receiving a target page; static crawlers crawl the target page; the target page uniform resource crawler not the static analysis locator (URL) as the dynamic URL; submit the URL to dynamically re-queue to the first; the first dynamic crawler continue to re-queue crawl dynamic URL.
  2. 2.根据权利要求1所述的方法,其特征在于:设置该第一去重队列时,进一步设置一第二去重队列;采用静态爬虫对该目标页面进行爬行时,进一步获得该目标页面中的静态URL ;进一步将该静态URL提交到该第二去重队列;进一步采用静态爬虫进一步对该第二去重队列中的静态URL进行爬行。 2. The method according to claim 1, wherein: setting the first weight to a queue, is further provided with a second weight to the queue; when static crawlers crawl the target page, the target pages to obtain further static URL; further submit the URL to the static weight to the second queue; further static crawler further weight to the second static URL queue crawl.
  3. 3.根据权利要求2所述的方法,其特征在于:采用动态爬虫继续对该第一去重队列中的动态URL进行爬行的步骤,包括:获得动态URL提交到该第一去重队列,获得静态URL提交到该第二去重队列;采用静态爬虫继续对该第二去重队列中的静态URL进行爬行的步骤,包括:获得动态URL提交到该第一去重队列;获得静态URL提交到该第二去重队列。 3. The method according to claim 2, wherein: the step of using dynamic crawler crawling continues to weight the first dynamic URL queue, comprising: obtaining a dynamic URL to be submitted to the first queue weight to give static URL to be submitted to the second queue weight; static heavy crawler to continue the second static URL queue step of crawling, comprising: obtaining a dynamic URL to be submitted to the first queue weight; obtained is submitted to a static URL the second de-emphasis queue.
  4. 4.根据权利要求2所述的方法,其特征在于,该方法进一步包括:该第一去重队列中的动态URL和该第二去重队列中的静态URL均爬行完毕时,或者根据一停止条件停止爬行。 4. The method according to claim 2, characterized in that the method further comprises: when the first weight to dynamic URL queue and the second queue to the weight-average static URL crawl is complete, according to a stop or conditions stopped crawling.
  5. 5.根据权利要求2或4所述的方法,其特征在于,设置该第一去重队列和该第二去重队列的步骤,包括:通过数据库或者内存链表结构设置该第一去重队列和该第二去重队列。 5. The method of claim 2 or claim 4, wherein the step of setting the first weight to the queue and a second queue to weight, comprising: a database or memory list structure is provided to the first queue and weight the second de-emphasis queue.
  6. 6. 一种网页爬虫系统,其特征在于,包括: 设置模块,用于设置一第一去重队列; 接收模块,用于接收一目标页面;静态爬虫模块,用于采用静态爬虫对该目标页面进行爬行;动态爬虫模块,用于将该目标页面中该静态爬虫分析不了的统一资源定位符(URL)作为动态URL,还用于采用动态爬虫继续对该第一去重队列中的动态URL进行爬行; 提交模块,用于将该动态URL提交到该第一去重队列。 A web crawler system, characterized by comprising: setting means for setting a first weight to a queue; receiving means for receiving a target page; static crawler module, for the static target page reptiles crawl; dynamic crawler module, uniform resource locator (URL) of the target page in the static analysis can not crawler as a dynamic URL, for further dynamic crawler proceed first to the heavy dynamic URL queue reptiles; submission module for submitting the URL to the first to dynamically re-queue.
  7. 7.根据权利要求6所述的系统,其特征在于: 该设置模块进一步用于设置一第二去重队列;该静态爬虫模块进一步用于采用静态爬虫对该目标页面进行爬行时,获得该目标页面中的静态URL,并用于采用静态爬虫进一步对该第二去重队列中的静态URL进行爬行; 该提交模块进一步用于将该静态URL提交到该第二去重队列。 7. The system according to claim 6, wherein: the setting module is further adapted to re-set a second queue; when the static module is further configured to statically crawler crawler to crawl the target page, the target is obtained static page URL, and used for further weight to crawl the second static URL queue crawler static; the submission module is further configured to submit the URL to the static weight to the second queue.
  8. 8.根据权利要求7所述的系统,其特征在于:该动态爬虫模块用于采用动态爬虫继续对该第一去重队列中的动态URL进行爬行,获得动态URL提交到该第一去重队列,获得静态URL提交到该第二去重队列;该静态爬虫模块用于采用静态爬虫继续对该第二去重队列中的静态URL进行爬行,获得动态URL提交到该第一去重队列,获得静态URL提交到该第二去重队列。 8. The system according to claim 7, wherein: the means for dynamically crawler crawler dynamic creep continues to weight the first dynamic URL queue, to obtain a dynamic URL to be submitted to the first queue weight , to obtain the second static URL submitted to heavy queues; it means for the static crawler crawler static creep continues to weight the second static URL queue, to obtain a dynamic URL to be submitted to the first queue weight to give static URL submitted to the second to re-queue.
  9. 9.根据权利要求6所述的系统,其特征在于,该系统进一步包括:停止模块,用于该第一去重队列中的动态URL和该第二去重队列中的静态URL均爬行完毕时,或者根据一停止条件停止爬行。 Stop means for the first weight to dynamic URL queue and the second queue to the weight average creep finished static URL: 9. The system according to claim 6, characterized in that the system further comprises , according to a stop or stops creep condition.
  10. 10.根据权利要求7或9所述的系统,其特征在于:所述设置模块用于通过数据库或者内存链表结构设置该第一去重队列和该第二去重队列。 10. The system of claim 7 or claim 9, wherein: said first setting means for setting the weight to the queue and the second queue by de-duplication database or memory list structure.
CN 201010189998 2010-05-25 2010-05-25 One kind of web crawler system and method CN102262635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010189998 CN102262635A (en) 2010-05-25 2010-05-25 One kind of web crawler system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010189998 CN102262635A (en) 2010-05-25 2010-05-25 One kind of web crawler system and method

Publications (1)

Publication Number Publication Date
CN102262635A true true CN102262635A (en) 2011-11-30

Family

ID=45009265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010189998 CN102262635A (en) 2010-05-25 2010-05-25 One kind of web crawler system and method

Country Status (1)

Country Link
CN (1) CN102262635A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567513A (en) * 2011-12-27 2012-07-11 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN102982162A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 System for acquiring webpage information
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN103685189A (en) * 2012-09-17 2014-03-26 百度在线网络技术(北京)有限公司 Website security evaluation method and system
WO2014105919A1 (en) * 2012-12-27 2014-07-03 Microsoft Corporation Identifying web pages in malware distribution networks
CN104714889A (en) * 2012-03-29 2015-06-17 北京奇虎科技有限公司 Test method and system for browser
CN103617198B (en) * 2013-11-14 2017-10-27 北京国双科技有限公司 Method and apparatus for merging the page

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059708A1 (en) * 2002-09-24 2004-03-25 Google, Inc. Methods and apparatus for serving relevant advertisements
CN1851706A (en) * 2006-05-30 2006-10-25 南京大学 Body learning based intelligent subject-type network reptile system configuration method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN101355587A (en) * 2008-09-17 2009-01-28 杭州华三通信技术有限公司 Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN101515300A (en) * 2009-04-02 2009-08-26 阿里巴巴集团控股有限公司 Method and system for grabbing Ajax webpage content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059708A1 (en) * 2002-09-24 2004-03-25 Google, Inc. Methods and apparatus for serving relevant advertisements
CN1851706A (en) * 2006-05-30 2006-10-25 南京大学 Body learning based intelligent subject-type network reptile system configuration method
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN101355587A (en) * 2008-09-17 2009-01-28 杭州华三通信技术有限公司 Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN101515300A (en) * 2009-04-02 2009-08-26 阿里巴巴集团控股有限公司 Method and system for grabbing Ajax webpage content

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567513A (en) * 2011-12-27 2012-07-11 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
CN102567513B (en) 2011-12-27 2014-09-17 北京神州绿盟信息安全科技股份有限公司 Method and equipment for collecting phishing websites
CN104714889A (en) * 2012-03-29 2015-06-17 北京奇虎科技有限公司 Test method and system for browser
CN102663058B (en) 2012-03-30 2013-12-18 华中科技大学 URL duplication removing method in distributed network crawler system
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN103685189A (en) * 2012-09-17 2014-03-26 百度在线网络技术(北京)有限公司 Website security evaluation method and system
CN102982162A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 System for acquiring webpage information
CN102982162B (en) * 2012-12-05 2016-04-13 北京奇虎科技有限公司 Get System Information page
WO2014105919A1 (en) * 2012-12-27 2014-07-03 Microsoft Corporation Identifying web pages in malware distribution networks
US9977900B2 (en) 2012-12-27 2018-05-22 Microsoft Technology Licensing, Llc Identifying web pages in malware distribution networks
CN103617198B (en) * 2013-11-14 2017-10-27 北京国双科技有限公司 Method and apparatus for merging the page

Similar Documents

Publication Publication Date Title
Elgazzar et al. Clustering wsdl documents to bootstrap the discovery of web services
US7584194B2 (en) Method and apparatus for an application crawler
US20080235671A1 (en) Injecting content into third party documents for document processing
Oram et al. Beautiful code
US20120198558A1 (en) Xss detection method and device
US20080010249A1 (en) Relevant term extraction and classification for Wiki content
Munzert et al. Automated data collection with R: A practical guide to web scraping and text mining
US20090106296A1 (en) Method and system for automated form aggregation
Punin et al. Web usage mining—Languages and algorithms
US20090083707A1 (en) Method for sharing a function between web contents
CN101471818A (en) Detection method and system for malevolence injection script web page
CN101561802A (en) Web page structural data extraction method and system
CN102591971A (en) Method and device for extracting webpage information
US20130066848A1 (en) Method and Apparatus for an Application Crawler
US20120078874A1 (en) Search Engine Indexing
US6772395B1 (en) Self-modifying data flow execution architecture
US20070198727A1 (en) Method, apparatus and system for extracting field-specific structured data from the web using sample
CN101599089A (en) Method and system for automatically searching and extracting update information on content of video service website
CN101515300A (en) Method and system for grabbing Ajax webpage content
CN102063488A (en) Code searching method based on semantics
US20100122161A1 (en) Method and system for intelligently truncating character strings in a service registry computing environment
US20080320498A1 (en) High Performance Script Behavior Detection Through Browser Shimming
CN103034727A (en) System for intercepting pop-up window in webpage
Rocco et al. Domain-specific web service discovery with service class descriptions
CN101004762A (en) Network web page system of a dynamic multidimensional Internet

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C12 Rejection of a patent application after its publication