CN105550255A

CN105550255A - Resource searching and scheduling method and device

Info

Publication number: CN105550255A
Application number: CN201510901428.8A
Authority: CN
Inventors: 郑燕琴
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2015-12-08
Filing date: 2015-12-08
Publication date: 2016-05-04
Anticipated expiration: 2035-12-08
Also published as: CN105550255B

Abstract

The present invention provides a method for obtaining the current main body link of the index page to be scheduled; comparing the obtained current main body link with the historical main body link of the index page to be scheduled; Perform page-turning scheduling until it is determined that there is no connection omission, then execute the subsequent scheduling operation; if it is determined that there is no link omission according to the comparison result, perform the subsequent scheduling operation. It can avoid the phenomenon of missing links in the process of resource search and scheduling, and improve the coverage of resource collection.

Description

Method and device for resource search and scheduling

技术领域technical field

本发明涉及数据搜索技术领域，特别是涉及资源搜索调度方法及装置。The invention relates to the technical field of data search, in particular to a resource search scheduling method and device.

背景技术Background technique

在网络数据搜索技术中，蜘蛛(Spider)系统位于搜索引擎数据流的最上游，负责将互联网上的资源采集到本地，提供给后续检索使用，是搜索引擎的最主要数据来源之一。spider系统的目标就是发现并抓取互联网中一切有价值的网页，为达到这个目标，首先就是发现有价值网页的链接，当前spider系统有一定的调度机制来尽量快而全的发现资源链接。In the network data search technology, the Spider system is located at the most upstream of the search engine data flow, and is responsible for collecting resources on the Internet to the local area and providing them for subsequent retrieval. It is one of the most important data sources of search engines. The goal of the spider system is to discover and crawl all valuable webpages on the Internet. To achieve this goal, the first thing is to find links to valuable webpages. The current spider system has a certain scheduling mechanism to discover resource links as quickly and completely as possible.

例如：在进行资源链接的调度时，可以设定以下机制：For example: when scheduling resource links, the following mechanisms can be set:

机制一：对挖掘的种子按一定的周期(例如1天调度20次)进行调度，以便能覆盖到所有的时效性的网页。Mechanism 1: Schedule the excavated seeds according to a certain period (for example, 20 times a day), so as to cover all time-sensitive web pages.

机制二：考虑到有限的流量及大量的索引页，对一般的索引页(不在种子集合范围内)按一定的周期(例如一周重抓一次)进行调度。Mechanism 2: Considering limited traffic and a large number of index pages, schedule general index pages (not within the scope of the seed collection) according to a certain period (for example, recrawl once a week).

上述调度机制具有至少下列缺点：The above scheduling mechanism has at least the following disadvantages:

对于机制一，种子调度周期间隔较短时，一般不会存在漏链的问题，但是可能会有流量的浪费，即当采点不准时，就是浪费流量；种子调度周期间隔较长时，可能会存在漏链。For mechanism 1, when the seed scheduling cycle interval is short, there is generally no problem of missing links, but there may be a waste of traffic, that is, when the sampling point is not on time, it is a waste of traffic; when the seed scheduling cycle interval is long, there may be a waste of traffic. There is a leaky chain.

对于机制二，由于调度周期间隔较长，可能会存在漏链。For mechanism 2, due to the long interval between scheduling cycles, there may be leaky chains.

在调度过程中出现漏链的情况会降低Spider系统的收录覆盖率。The occurrence of missing links in the scheduling process will reduce the collection coverage of the Spider system.

发明内容Contents of the invention

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的资源搜索调度方法和相应的资源搜索调度装置，可以避免资源搜索调度过程中出现漏链现象，提高资源收录覆盖率。In view of the above problems, the present invention is proposed to provide a resource search scheduling method and a corresponding resource search scheduling device that overcome the above problems or at least partially solve the above problems, which can avoid the phenomenon of missing links in the resource search scheduling process and improve resource collection. coverage.

本发明提供一种资源搜索调度方法，包括：The present invention provides a method for resource search and scheduling, including:

获取待调度索引页的当前主体链接；Obtain the current body link of the index page to be scheduled;

将获取的所述当前主体链接与所述待调度索引页的历史主体链接进行比较；comparing the acquired current subject link with the historical subject link of the index page to be scheduled;

若根据比较结果确定存在链接遗漏，对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，执行后续调度操作；If it is determined that there is a link omission according to the comparison result, perform page-turning scheduling on the index page to be scheduled until it is determined that there is no connection omission, and then perform subsequent scheduling operations;

若根据比较结果确定不存在链接遗漏，执行后续调度操作。If it is determined according to the comparison result that there is no link omission, a subsequent scheduling operation is performed.

在一些可选的实施例中，若根据比较结果确定存在链接遗漏，对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，具体包括：In some optional embodiments, if it is determined according to the comparison result that there is a link omission, the index page to be scheduled is page-turned until it is determined that there is no link omission, specifically including:

当所述当前主体链接与历史主体链接不存在交集时，获取所述待调度索引页下一页的当前主体链接，并返回执行所述将获取的当前主体链接与所述待调度索引页的历史主体链接进行比较的步骤，直至所述当前主体链接与历史主体链接存在交集时。When there is no intersection between the current subject link and the historical subject link, obtain the current subject link of the next page of the to-be-scheduled index page, and return the execution history of the to-be-acquired current subject link and the to-be-scheduled index page The step of comparing subject links until the intersection of the current subject link and the historical subject link exists.

在一些可选的实施例中，所述获取待调度索引页的当前主体链接，具体包括：In some optional embodiments, the obtaining the current subject link of the index page to be scheduled specifically includes:

确定所述待调度索引页中的最大相似块，得到待调度索引页的当前主体链接。Determine the largest similar block in the index page to be scheduled to obtain the current body link of the index page to be scheduled.

在一些可选的实施例中，确定所述待调度索引页中的最大相似块，具体包括：In some optional embodiments, determining the largest similar block in the to-be-scheduled index page specifically includes:

获取可扩展标记语言XML路径xpath相同的节点，得到相似块；Get the nodes with the same xpath in the Extensible Markup Language XML path to get similar blocks;

根据相似块的位置和面积，确定出所述待调度索引页中的最大相似块。Determine the largest similar block in the to-be-scheduled index page according to the position and area of the similar block.

在一些可选的实施例中，所述相似块的位置根据所述相似块在调度索引页面中的宽度、高度、上边距、左边距确定；In some optional embodiments, the position of the similar block is determined according to the width, height, top margin, and left margin of the similar block in the scheduling index page;

所述相似块的面积根据所述相似块在调度索引页面中的宽度、高度确定。The area of the similar block is determined according to the width and height of the similar block in the scheduling index page.

在一些可选的实施例中，根据相似块的位置和面积，确定出所述待调度索引页中的最大相似块，具体包括：In some optional embodiments, the largest similar block in the to-be-scheduled index page is determined according to the position and area of the similar block, specifically including:

确定相似块中面积最大且包含页面中心点的相似块为最大相似块。Determine the similar block with the largest area and containing the center point of the page among the similar blocks as the largest similar block.

在一些可选的实施例中，所述对待调度索引页进行翻页调度之前，还包括：In some optional embodiments, before performing paging scheduling on the index page to be scheduled, it also includes:

获取所述待调度索引页中的翻页块，通过正则匹配链接的锚auchor，确定下一页链接的统一资源定位符URL，以便根据所述URL进行翻页抓取。Obtain the page turning block in the index page to be scheduled, and determine the URL of the next page link by matching the anchor auchor of the link, so as to perform page turning and crawling according to the URL.

在一些可选的实施例中，获取所述待调度索引页中的翻页块，具体包括：In some optional embodiments, obtaining the page turning block in the index page to be scheduled specifically includes:

通过正则匹配链接的auchor所包含的信息，确定所述待调度索引页中的翻页块，获取确定的翻页块。By regularly matching the information contained in the linked auchor, the page-turning block in the to-be-scheduled index page is determined, and the determined page-turning block is obtained.

在一些可选的实施例中，如果所述正则匹配链接的auchor包括翻页块特征信息，则基于所述翻页块信息确定所述待调度索引页中的翻页块。In some optional embodiments, if the auchor of the regular matching link includes the feature information of the page turning block, the page turning block in the index page to be scheduled is determined based on the page turning block information.

在一些可选的实施例中，所述通过正则匹配链接的锚auchor，确定下一页链接的统一资源定位符URL，具体包括：In some optional embodiments, the determining the Uniform Resource Locator URL of the next page link by matching the anchor auchor of the link specifically includes:

所述正则匹配链接的auchor包含下一页链接信息时，将该auchor对应的链接确定为下一页链接的URL。When the auchor of the regular matching link contains the next page link information, the link corresponding to the auchor is determined as the URL of the next page link.

在一些可选的实施例中，所述将获取的当前主体链接与所述待调度索引页的历史主体链接进行比较之前，还包括：In some optional embodiments, before comparing the acquired current subject link with the historical subject link of the to-be-scheduled index page, it further includes:

根据所述当前主体链接中各链接所包括的发布时间信息，顺序提取时间序列；Sequentially extract time series according to the release time information included in each link in the current subject link;

根据相邻链接的发布时间信息，判断提取的时间序列是否具备顺序或逆序特征；According to the release time information of the adjacent links, it is judged whether the extracted time series has sequence or reverse sequence characteristics;

当不具备时，将所述当前主体链接中各链接按时间顺序或时间逆序进行排序。When not available, sort the links in the current subject links in chronological order or in reverse chronological order.

本发明实施例还提供一种资源搜索调度装置，包括：An embodiment of the present invention also provides a resource search scheduling device, including:

获取模块，用于获取待调度索引页的当前主体链接；An acquisition module, configured to acquire the current body link of the index page to be scheduled;

比较模块，用于将获取的所述当前主体链接与所述待调度索引页的历史主体链接进行比较；A comparison module, configured to compare the obtained current subject link with the historical subject link of the index page to be scheduled;

执行模块，用于若根据比较结果确定存在链接遗漏，对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，执行后续调度操作；若根据比较结果确定不存在链接遗漏，执行后续调度操作。The execution module is used to perform page-turning scheduling on the to-be-scheduled index page if it is determined that there is a link omission according to the comparison result, until it is determined that there is no connection omission, and then perform subsequent scheduling operations; if it is determined that there is no link omission according to the comparison result, perform subsequent scheduling operations .

在一些可选的实施例中，所述执行模块，具体用于：In some optional embodiments, the execution module is specifically configured to:

当所述当前主体链接与历史主体链接不存在交集时，通知所述获取模块获取所述待调度索引页下一页的当前主体链接，所述比较模块返回执行所述将获取的当前主体链接与所述待调度索引页的历史主体链接进行比较的步骤，直至所述当前主体链接与历史主体链接存在交集时，执行后续调度操作。When there is no intersection between the current subject link and the historical subject link, notify the acquisition module to obtain the current subject link of the next page of the index page to be scheduled, and the comparison module returns to execute the current subject link to be acquired and The step of comparing the historical main body links of the to-be-scheduled index page until the current main body link overlaps with the historical main body links is followed by a subsequent scheduling operation.

在一些可选的实施例中，所述获取模块，具体用于：In some optional embodiments, the acquisition module is specifically used to:

根据所述相似块在调度索引页面中的宽度、高度、上边距、左边距确定所述相似块的位置，以及根据所述相似块在调度索引页面中的宽度、高度确定相似块的面积。Determine the position of the similar block according to the width, height, top margin, and left margin of the similar block in the scheduling index page, and determine the area of the similar block according to the width and height of the similar block in the scheduling index page.

在一些可选的实施例中，所述获取模块，还用于：In some optional embodiments, the acquisition module is also used for:

如果所述正则匹配链接的auchor包括翻页块特征信息，则基于所述翻页块信息确定所述待调度索引页中的翻页块。If the auchor of the regular matching link includes the feature information of the page turning block, then determine the page turning block in the to-be-scheduled index page based on the page turning block information.

在一些可选的实施例中，所述比较模块，还用于：In some optional embodiments, the comparison module is also used for:

本发明的资源搜索调度方法及装置，在进行调度操作前，获取待调度索引页的当前主体链接；将获取的当前主体链接与待调度索引页的历史主体链接进行比较；以确定是否存在链接遗漏，当确定不存在连接遗漏时，执行后续调度操作，从而避免了资源调度过程中出现漏链的可能性，提高了调度资源的收录覆盖率，该方法不需要通过缩短调度周期的方式实现，不会造成流量开销的增加，能够在节约网络流量资源的情况下，实现避免资源调度过程中的漏链现象。The resource search scheduling method and device of the present invention obtain the current subject link of the index page to be scheduled before performing the scheduling operation; compare the obtained current subject link with the historical subject link of the index page to be scheduled; to determine whether there is a link omission , when it is determined that there is no missing connection, the subsequent scheduling operation is performed, thereby avoiding the possibility of missing links in the resource scheduling process and improving the collection coverage of scheduling resources. This method does not need to be realized by shortening the scheduling cycle, and does not It will cause an increase in traffic overhead, and can avoid the phenomenon of missing links in the resource scheduling process while saving network traffic resources.

进一步地，本发明的上述方法，在调度前逐页获取调度索引页的当前主体链接，逐页比较，从而尽最大可能的避免了漏链的发生。Furthermore, the above method of the present invention obtains the current main link of the scheduling index page page by page before scheduling, and compares them page by page, thereby avoiding the occurrence of missing links as much as possible.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

根据下文结合附图对本发明具体实施例的详细描述，本领域技术人员将会更加明了本发明的上述以及其他目的、优点和特征。Those skilled in the art will be more aware of the above and other objects, advantages and features of the present invention according to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1是本发明实施例一中资源搜索调度方法的流程图；FIG. 1 is a flowchart of a method for resource search and scheduling in Embodiment 1 of the present invention;

图2是本发明实施例一中一个索引页的截图；Fig. 2 is a screenshot of an index page in Embodiment 1 of the present invention;

图3是本发明实施例二中资源搜索调度方法的流程图；FIG. 3 is a flowchart of a method for resource search and scheduling in Embodiment 2 of the present invention;

图4是本发明实施例二中确定待调度索引页中的最大相似块的示例图；FIG. 4 is an example diagram of determining the largest similar block in an index page to be scheduled in Embodiment 2 of the present invention;

图5是本发明实施例三中资源搜索调度方法的流程图；FIG. 5 is a flow chart of a method for resource search and scheduling in Embodiment 3 of the present invention;

图6是本发明实施例三中另一个索引页的截图；Fig. 6 is a screenshot of another index page in Embodiment 3 of the present invention;

图7是本发明实施例四中资源搜索调度方法的流程图；FIG. 7 is a flow chart of a method for resource search and scheduling in Embodiment 4 of the present invention;

图8是本发明实施例中资源搜索调度装置的结构示意图。Fig. 8 is a schematic structural diagram of a resource search scheduling device in an embodiment of the present invention.

具体实施方式detailed description

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

为了解决现有技术中资源调度时的漏链问题，本发明实施例提供一种资源搜索调度方法，能够尽可能的避免漏链现象，且不增加流量开销，能够提高了调度资源的收录覆盖率。下面通过具体的实施例进行详细说明。In order to solve the problem of missing links in resource scheduling in the prior art, the embodiment of the present invention provides a resource search and scheduling method, which can avoid the phenomenon of missing links as much as possible, without increasing traffic overhead, and can improve the collection coverage of scheduling resources . The following will be described in detail through specific examples.

实施例一：Embodiment one:

本发明实施例一提供的资源搜索调度方法，其流程如图1所示，包括如下步骤：The resource search and scheduling method provided by Embodiment 1 of the present invention has a process as shown in FIG. 1 , including the following steps:

步骤S101：获取待调度索引页的当前主体链接。Step S101: Obtain the current subject link of the index page to be scheduled.

对于索引页每次调度时，解析索引页网页抽取并记录其发现的主体链接。具体的，对于待调度索引页，通过确定待调度索引页中的最大相似块，得到待调度索引页的当前主体链接。For each scheduling of the index page, parse the index page web page extraction and record the main link it finds. Specifically, for the index page to be scheduled, the current body link of the index page to be scheduled is obtained by determining the largest similar block in the index page to be scheduled.

其中，索引页是指网页上的主体是链接，而非内容文字的网页。主体链接是指索引页网页上的主体对应的链接集合。例如，图2所示为一个索引页的截图，索引页http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml网页中的主体链接如图2中的大方框所示，其中包括了该索引页中各主体的链接。Wherein, the index page refers to a web page whose main body is a link rather than content text. The subject link refers to a set of links corresponding to the subject on the index page webpage. For example, Figure 2 shows a screenshot of an index page. The main link in the index page http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml is the big box in Figure 2 , which includes links to the topics on this index page.

步骤S102：将获取的当前主体链接与待调度索引页的历史主体链接进行比较。Step S102: Compare the acquired current subject link with the historical subject link of the index page to be scheduled.

获取到待调度索引页的当前主体链接后，获取该调度索引页的历史主体链接，将两者进行比对，以确定当前主体链接是否存在漏链。After obtaining the current subject link of the index page to be scheduled, obtain the historical subject link of the scheduled index page, and compare the two to determine whether there is a missing link in the current subject link.

此时可选的，也可以一次性获取索引页系列历史主体链接集合。索引页及其对应的一系列翻页称为索引页系列，索引页系列历史主体链接集合可以包括索引页系列调度时抽取出的主体链接总和。Optionally at this time, the index page series historical body link collection can also be acquired at one time. An index page and a series of corresponding turning pages are called an index page series, and the index page series historical subject link collection may include the sum of subject links extracted during index page series scheduling.

步骤S103：根据比较结果确定是否存在链接遗漏。若是，执行步骤S104；若否，执行步骤S105。Step S103: Determine whether there is a missing link according to the comparison result. If yes, execute step S104; if no, execute step S105.

根据待调度索引页的当前主体链接与历史主体链接的比较结果，可以确定出待调度索引页是否存在漏链。例如待调度索引页的当前主体链接与历史主体链接完全不相同时，证明两次调度之间有漏链。According to the comparison result of the current subject link of the to-be-scheduled index page and the historical subject link, it can be determined whether there is a missing link on the to-be-scheduled index page. For example, when the current main body link of the index page to be scheduled is completely different from the historical main body link, it proves that there is a leaky link between the two schedulings.

步骤S104：对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，执行步骤S105。Step S104: perform page turning scheduling on the index page to be scheduled until it is determined that there is no missing connection, then execute step S105.

若根据比较结果确定存在链接遗漏，对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，执行后续调度操作。If it is determined that there is a link omission according to the comparison result, perform page-turning scheduling on the index page to be scheduled until it is determined that there is no link omission, and then perform subsequent scheduling operations.

当确定待调度索引页存在漏链时，对待调度索引页进行翻页调度，调度下一页来发现漏链，如果翻页调度后，通过比较下一页的当前主体链接和历史主体链接，发现仍存在漏链，则继续进行翻页调度，直至确定不存在连接遗漏时为止。When it is determined that there is a missing link on the index page to be scheduled, page-turning scheduling is performed on the index page to be scheduled, and the next page is scheduled to find the missing link. If there is still a missing link, continue to perform paging until it is determined that there is no connection missing.

步骤S105：执行后续调度操作。Step S105: Execute subsequent scheduling operations.

当根据比较结果确定待调度索引页不存在链接遗漏，执行后续调度操作。从而避免了漏链现象，且相对于现有技术中通过缩短调度周期来减少漏链的做法，该方法能够更好的避免漏链，且不需要增加网络流量使用量。可以节约系统流量资源。When it is determined according to the comparison result that there is no link omission in the index page to be scheduled, a subsequent scheduling operation is performed. Thus, the phenomenon of missing links is avoided, and compared with the practice of shortening the scheduling cycle to reduce missing links in the prior art, this method can better avoid missing links without increasing the usage of network traffic. It can save system traffic resources.

实施例二：Embodiment two:

本发明实施例二提供的资源搜索调度方法，其流程如图3所示，包括如下步骤：The resource search and scheduling method provided by Embodiment 2 of the present invention has a process as shown in FIG. 3 and includes the following steps:

步骤S201：获取待调度索引页的当前主体链接。Step S201: Obtain the current subject link of the index page to be scheduled.

对于待调度索引页，通过确定待调度索引页中的最大相似块，得到待调度索引页的当前主体链接，其中，确定待调度索引页中的最大相似块，具体包括：For the index page to be scheduled, the current subject link of the index page to be scheduled is obtained by determining the largest similar block in the index page to be scheduled, wherein the determination of the largest similar block in the index page to be scheduled specifically includes:

获取可扩展标记语言(eXtensibleMarkupLanguage，XML)路径，简称Xpath，相同的节点，得到相似块；根据相似块的位置和面积，确定出待调度索引页中的最大相似块。Get the extensible markup language (eXtensibleMarkupLanguage, XML) path, referred to as XPath, the same node, get the similar block; according to the position and area of the similar block, determine the largest similar block in the index page to be scheduled.

可选的，相似块的位置根据相似块在调度索引页面中的宽度、高度、上边距、左边距确定；相似块的面积根据相似块在调度索引页面中的宽度、高度确定。Optionally, the position of the similar block is determined according to the width, height, top margin, and left margin of the similar block in the scheduling index page; the area of the similar block is determined according to the width and height of the similar block in the scheduling index page.

可选的，根据相似块的位置和面积，确定出待调度索引页中的最大相似块，具体包括：确定相似块中面积最大且包含页面中心点的相似块为最大相似块。Optionally, determining the largest similar block in the index page to be scheduled according to the location and area of the similar block, specifically includes: determining the largest similar block among the similar blocks that includes the center point of the page as the largest similar block.

例如：Xpath相同的节点(节点数>4)组成的集合这里称为相似块；计算相似块的位置信息，即在页面中宽度、高度、上边距、左边距，这四项信息便能决定相似块在网页中的位置，同事根据宽度和高度等信息计算相似块的面积。把面积最大、且包含中心点的相似块称为最大相似块。For example: a collection of nodes with the same Xpath (the number of nodes > 4) is called a similar block; calculate the position information of similar blocks, that is, width, height, top margin, and left margin in the page, and these four items of information can determine the similarity The position of the block in the web page, and colleagues calculate the area of similar blocks based on information such as width and height. The similar block with the largest area and containing the center point is called the largest similar block.

如图4所示，为确定待调度索引页中的最大相似块的示例。对于索引页http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml网页的相似块，图4中的3个粗线方框圈出了3个相似块，很明显最下方的方框所圈出的即为最大相似块。As shown in FIG. 4 , it is an example of determining the largest similar block in the index page to be scheduled. For the similar blocks on the index page http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml, the three thick-lined boxes in Figure 4 circled three similar blocks, which is very Obviously the box circled at the bottom is the largest similar block.

最大相似块中的链接即为主体链接，即最下方方框中各条新闻对应的链接即为主体链接，如：The link in the most similar block is the main link, that is, the link corresponding to each news in the bottom box is the main link, such as:

长江沉船扶正后下一步整体提升排水：After the Yangtze River sunken ship is righted, the next step is to improve the drainage as a whole:

http://news.sina.com.cn/c/2015-06-05/133131917716.shtml；http://news.sina.com.cn/c/2015-06-05/133131917716.shtml;

广州多家医院试点当日归宅：白天手术晚上回家：Many hospitals in Guangzhou are piloting one-day return home: surgery during the day and home at night:

http://news.sina.com.cn/c/2015-06-05/125531917706.shtml；http://news.sina.com.cn/c/2015-06-05/125531917706.shtml;

……...

步骤S202：将获取的当前主体链接与待调度索引页的历史主体链接进行比较。Step S202: Compare the obtained current subject link with the historical subject link of the index page to be scheduled.

根据待调度索引页的当前主体链接与历史主体链接的比较结果，来确定待调度索引页是否存在漏链。According to the comparison result of the current subject link of the to-be-scheduled index page and the historical subject link, it is determined whether there is a missing link on the to-be-scheduled index page.

步骤S203：判断获取的当前主体链接与历史主体链接是否存在交集。若是，执行步骤S205；若否，执行步骤S204。Step S203: Determine whether there is an intersection between the obtained current subject link and the historical subject link. If yes, execute step S205; if not, execute step S204.

根据待调度索引页的当前主体链接与历史主体链接是否存在交集可以确定两者是否相同。According to whether there is an intersection between the current subject link and the historical subject link of the index page to be scheduled, it can be determined whether the two are the same.

步骤S204：获取待调度索引页下一页的当前主体链接。并返回继续执行步骤S202。Step S204: Obtain the current subject link of the next page of the to-be-scheduled index page. And return to continue to execute step S202.

当获取的当前主体链接与历史主体链接不存在交集时，获取待调度索引页下一页的当前主体链接，并返回执行将获取的当前主体链接与待调度索引页的历史主体链接进行比较的步骤，直至当前主体链接与历史主体链接存在交集时。When there is no intersection between the obtained current subject link and the historical subject link, obtain the current subject link of the next page of the index page to be scheduled, and return to the step of comparing the acquired current subject link with the historical subject link of the to-be-scheduled index page , until there is an intersection between the current subject link and the historical subject link.

如上述步骤S202-步骤S204所述，每次调度待调度索引页时，比较当前发现的主体链接和上一次调度发现的主体链接，如果不存在交集，即上一次调度发现的主体链接与当前发现的主体链接完全不相同，则说明这两次调度之间有漏链，需要调度翻页下一页(即第2页)来发现漏链，并且将当前发现的主体链接加入到该索引页系列历史主体链接集合中。当调度下一页时，同理，比较当前发现的主体链接和该索引页系列历史主体链接集合，如果不存在交集，则说明仍然有漏链，继续发起翻页到下一页(即第3页)调度。依此类推，直到当前发现的主体链接和该索引页系列历史主体链接集合存在交集，则说明漏链已经找全了，不需要再发起翻页的调度。As described in the above step S202-step S204, each time the index page to be scheduled is scheduled, compare the currently found subject link with the subject link found in the previous scheduling, if there is no intersection, that is, the subject link found in the last scheduling and the currently found The main links of the two schedules are completely different, which means that there is a missing link between the two schedules. It is necessary to schedule the page to turn the next page (ie page 2) to find the missing link, and add the currently found main link to the index page series In the collection of historical subject links. When scheduling the next page, similarly, compare the currently discovered subject link with the historical subject link set of the index page series. If there is no intersection, it means that there is still a missing link, and continue to initiate page turning to the next page (that is, the third page) page) scheduling. By analogy, until there is an intersection between the currently discovered main body link and the historical main body link set of the index page series, it means that the missing links have been found, and there is no need to initiate the scheduling of page turning.

步骤S205：执行后续调度操作。Step S205: Execute subsequent scheduling operations.

当根据比较结果确定待调度索引页不存在链接遗漏，执行后续调度操作。When it is determined according to the comparison result that there is no link omission in the index page to be scheduled, a subsequent scheduling operation is performed.

本发明实施例三提供的资源搜索调度方法，其流程如图5所示，包括如下步骤：The resource search and scheduling method provided by Embodiment 3 of the present invention has a process as shown in FIG. 5, including the following steps:

步骤S301：获取待调度索引页的当前主体链接。Step S301: Obtain the current subject link of the index page to be scheduled.

步骤S302：将获取的当前主体链接与待调度索引页的历史主体链接进行比较。Step S302: Compare the obtained current subject link with the historical subject link of the index page to be scheduled.

步骤S303：判断获取的当前主体链接与历史主体链接是否存在交集。若是，执行步骤S307；若否，执行步骤S304。Step S303: Determine whether there is an intersection between the obtained current subject link and the historical subject link. If yes, execute step S307; if not, execute step S304.

步骤S304：获取待调度索引页中的翻页块。Step S304: Obtain the page turning block in the index page to be scheduled.

本实施例中，在实施例一和实施例二的基础上，在获取待调度索引页下一页的当前主体链接之前，获取待调度索引页中的翻页块，以方便进行翻页调度。In this embodiment, on the basis of Embodiment 1 and Embodiment 2, before obtaining the current body link of the next page of the index page to be scheduled, the page turning block in the index page to be scheduled is obtained, so as to facilitate page turning scheduling.

获取所述待调度索引页中的翻页块，具体包括：通过正则匹配链接的锚(auchor)所包含的信息，确定待调度索引页中的翻页块，获取确定的翻页块。Acquiring the page-turning block in the index page to be scheduled specifically includes: determining the page-turning block in the index page to be scheduled by matching the information contained in the anchor (auchor) of the regularization, and obtaining the determined page-turning block.

如果正则匹配链接的auchor包括翻页块特征信息，则基于翻页块信息确定待调度索引页中的翻页块。If the auchor of the regular matching link includes the feature information of the page turning block, the page turning block in the to-be-scheduled index page is determined based on the page turning block information.

例如，图2中上方的长方框所示的即为翻页块。图4中中间的长方框所示的也为翻页块。图4中的所示的翻页块也可以从http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml网页的相似块中抽取。For example, what is shown in the upper rectangular box in FIG. 2 is the page turning block. What is shown in the rectangular box in the middle of Fig. 4 is also a page turning block. The page turning blocks shown in Fig. 4 can also be extracted from similar blocks on the http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml web page.

可选的，可以通过正则匹配链接的anchor来判断是否为翻页块，例如当包含下列翻页块特征信息中的至少一项时可以确定为翻页块：“数字”、“<”、“>”、“<<”、“>>”、“上一页”、“下一页”、“第一页”、“最后一页”等关键字。Optionally, it can be determined whether it is a page-turning block by regularly matching the anchor of the link. For example, it can be determined as a page-turning block when it contains at least one of the following feature information of the page-turning block: "number", "<", " >", "<<", ">>", "Previous Page", "Next Page", "First Page", "Last Page" and other keywords.

步骤S305：通过正则匹配链接的锚auchor，确定下一页链接的统一资源定位符(UniformResoureLocatorURL)，以便根据确定出的URL进行翻页抓取。Step S305: Determine the uniform resource locator (UniformResoureLocatorURL) of the link on the next page by matching the anchor auchor of the link in the regular pattern, so as to perform page flipping and crawling according to the determined URL.

当正则匹配链接的auchor包含下一页链接信息时，将该auchor对应的链接确定为下一页链接的URL，以便根据确定出的URL进行翻页抓取。When the auchor of the regular matching link contains the link information of the next page, the link corresponding to the auchor is determined as the URL of the next page link, so as to perform page-turning crawling according to the determined URL.

确定下一页链接时，对翻页块中的节点，通过正则匹配链接的anchor，判断是否匹配下一页链接信息，其中下一页链接信息包括下列信息中的至少一项：“指定的数字”、“下一页”、“后一页”、“>”等关键字。如果匹配出包括下一页链接信息时，则将该anchor对应的链接记为下一页链接的URL；否则，通过当前页页数计算下一页页数，然后拼接出下一页链接的URL。When determining the link to the next page, for the node in the page turning block, judge whether it matches the link information of the next page by matching the anchor of the link regularly, where the link information of the next page includes at least one of the following information: "specified number ", "next page", "next page", ">" and other keywords. If the link information of the next page is matched, record the link corresponding to the anchor as the URL of the next page link; otherwise, calculate the number of the next page based on the current page number, and then splicing out the URL of the next page link .

如图4所示翻页块中的节点anchor正则匹配“下一页”关键字，则其下一页链接即为anchor为“下一页”对应的链接http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_2.shtml。As shown in Figure 4, the anchor of the node in the page-turning block matches the keyword "next page", and the link to the next page is the link http://roll.news.sina whose anchor is "next page". com.cn/news/gnxw/gdxw1/index_2.shtml.

图6为一个网页的部分截图，其中包括了一个翻页块的示例，如图6中下方的长方框及为翻页块。翻页块中的节点anchor正则匹配下一页链接信息时，“下一页”、“后一页”、“>”等均未能匹配，因此计算下一页页数(当前页为索引页，即翻页第1页，则下一页页数则为2)。从翻页块中找出节点anchor匹配数字2的链接，这里为http://gold.jrj.com.cn/list/hjzx-2.shtml。FIG. 6 is a partial screenshot of a webpage, which includes an example of a page turning block, such as the lower rectangular box in FIG. 6 is a page turning block. When the node anchor in the page turning block matches the link information of the next page, "next page", "next page", ">" and so on cannot be matched, so the number of next pages is calculated (the current page is the index page , that is, turn the first page, then the number of pages on the next page is 2). Find the link of the node anchor matching the number 2 from the page turning block, here is http://gold.jrj.com.cn/list/hjzx-2.shtml.

步骤S306：获取待调度索引页下一页的当前主体链接。并返回继续执行步骤S302。Step S306: Acquiring the current subject link of the next page of the to-be-scheduled index page. And return to continue to execute step S302.

步骤S307：执行后续调度操作。Step S307: Execute subsequent scheduling operations.

上述未详细阐述的步骤，参见实施例一和实施例二中的相应步骤的描述。For steps not described in detail above, refer to the description of the corresponding steps in Embodiment 1 and Embodiment 2.

实施例四：Embodiment four:

本发明实施例四提供的资源搜索调度方法，其流程如图7所示，包括如下步骤：The resource search and scheduling method provided in Embodiment 4 of the present invention has a process as shown in FIG. 7 , including the following steps:

步骤S401：获取待调度索引页的当前主体链接。Step S401: Obtain the current subject link of the index page to be scheduled.

步骤S402：根据获取的当前主体链接中各链接所包括的发布时间信息，顺序提取时间序列。Step S402: According to the release time information contained in each link in the obtained current main body link, sequentially extract the time series.

本实施例中，将获取的当前主体链接与所述待调度索引页的历史主体链接进行比较之前，确定获取的当前主体链接中各链接是否具备时间顺序或逆序排列规律，当不具备时，对其进行排列，从而方便后续的主体链接的比较。In this embodiment, before comparing the obtained current main body link with the historical main body link of the index page to be scheduled, determine whether each link in the obtained current main body link has a chronological order or a reverse order arrangement rule, and if not, the They are arranged to facilitate subsequent comparisons of subject links.

如图4所示的，下方大方框中，通过下划线标出了的即为主体链接对应的发布时间，每个主体链接都有一个发布时间信息与之对应，基于此可以顺序抽取时间序列。As shown in Figure 4, in the big box below, the underlined is the release time corresponding to the subject link, and each subject link has a release time information corresponding to it, based on which the time series can be sequentially extracted.

步骤S403：根据相邻链接的发布时间信息，判断提取的时间序列是否具备顺序或逆序特征；当具备时，执行步骤S405；当不具备时，执行步骤S404。Step S403: According to the release time information of the adjacent links, judge whether the extracted time series has sequence or reverse sequence features; if yes, perform step S405; if not, perform step S404.

根据相邻链接的发布时间信息，通过判断抽取的时间序列是否具备顺序或逆序的特征，来实现判断主体链接是否按时间顺序或逆序排序。According to the publishing time information of the adjacent links, by judging whether the extracted time series has the characteristics of sequence or reverse order, it is realized whether the subject links are sorted in time order or reverse order.

步骤S404：将当前主体链接中各链接按时间顺序或时间逆序进行排序。Step S404: Sort the links in the current subject link in chronological order or in reverse chronological order.

当抽取的时间序列不具备时间顺序或逆序特征时，说明主体链接不是按时间顺序或逆序排序的，为了方便主体链接的比较，可以对主体链接进行时间顺序或时间逆序排序操作。When the extracted time series does not have the characteristics of chronological order or reverse order, it means that the subject links are not sorted in chronological order or in reverse order. In order to facilitate the comparison of subject links, the subject links can be sorted in chronological order or in reverse order.

步骤S405：将获取的当前主体链接与待调度索引页的历史主体链接进行比较。Step S405: Compare the acquired current subject link with the historical subject link of the index page to be scheduled.

步骤S406：判断获取的当前主体链接与历史主体链接是否存在交集。若是，执行步骤S408；若否，执行步骤S407。Step S406: Determine whether there is an intersection between the obtained current subject link and the historical subject link. If yes, execute step S408; if no, execute step S407.

步骤S407：获取待调度索引页下一页的当前主体链接。并返回继续执行步骤S402。Step S407: Acquiring the current subject link of the next page of the to-be-scheduled index page. And return to continue to execute step S402.

对于主体链接按时间顺序或逆序排序的索引页，每次调度索引页时，比较当前发现的主体链接和历史主体链接集合，如果不存在交集，即上一次调度发现的主体链接与当前发现的主体链接完全不相同，则说明这两次调度之间有漏链，需要调度翻页第2页来发现漏链，并且将当前发现的主体链接加入到该索引页系列历史主体链接集合中。当调度翻页第2页时，同理，比较当前发现的主体链接和该索引页系列历史主体链接集合，如果不存在交集，则说明仍然有漏链，继续发起翻页第3页调度。依此类推，直到当前发现的主体链接和该索引页系列历史主体链接集合存在交集，则说明漏链已经找全了，不需要再发起翻页的调度。For index pages whose subject links are sorted in chronological order or in reverse order, each time an index page is scheduled, compare the currently discovered subject link and the historical subject link set, if there is no intersection, that is, the subject link found in the last scheduling and the currently discovered subject If the links are completely different, it means that there is a missing link between the two schedules, and the scheduling needs to turn page 2 to find the missing link, and add the currently discovered subject link to the historical subject link collection of the index page series. When scheduling to turn the second page, similarly, compare the currently discovered subject link with the historical subject link set of the index page series. If there is no intersection, it means that there is still a missing link, and continue to initiate the third page scheduling. By analogy, until there is an intersection between the currently discovered main body link and the historical main body link set of the index page series, it means that the missing links have been found, and there is no need to initiate page turning scheduling.

对于主体链接按时间顺序排序的索引页，由于一般比较关注最新的主体链接，因此可以考虑从最后一页开始往前调度，机制都是一样的。For the index pages whose main body links are sorted in chronological order, since the latest main body links are generally more concerned, it can be considered to schedule from the last page forward, and the mechanism is the same.

步骤S408：执行后续调度操作。Step S408: Execute subsequent scheduling operations.

上述未详细阐述的步骤，参见实施例一、实施例二和实施例三中的相应步骤的描述。For the steps not described in detail above, refer to the description of the corresponding steps in Embodiment 1, Embodiment 2 and Embodiment 3.

上述实施例三和实施例四中相对于实施例一和实施例二所增加的可选步骤，可以如实施例三和实施例四中所描述的单独增加其中各实施例所描述的一部分步骤，也可以在实施例一和实施例二的基础上同时增加实施例三和实施例四中所增加的步骤，此种实现方式不再以新的实施例进行详细阐述。Compared with the optional steps added in the first and second embodiments in the third and fourth embodiments above, a part of the steps described in each embodiment can be added separately as described in the third and fourth embodiments, The steps added in Embodiment 3 and Embodiment 4 can also be added on the basis of Embodiment 1 and Embodiment 2, and this implementation will not be described in detail with a new embodiment.

基于同一发明构思，本发明实施例还提供一种资源搜索调度装置，该装置的结构如图8所示，包括：获取模块801、比较模块802和执行模块803。Based on the same inventive concept, an embodiment of the present invention also provides a device for resource search and scheduling. The structure of the device is shown in FIG. 8 , including: an acquisition module 801 , a comparison module 802 and an execution module 803 .

获取模块801，用于获取待调度索引页的当前主体链接.Obtaining module 801, configured to obtain the current body link of the index page to be scheduled.

比较模块802，用于将获取的当前主体链接与所述待调度索引页的历史主体链接进行比较。The comparing module 802 is configured to compare the acquired current main body link with the historical main body link of the index page to be scheduled.

执行模块803，用于若根据比较结果确定存在链接遗漏，对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，执行后续调度操作；若根据比较结果确定不存在链接遗漏，执行后续调度操作。The execution module 803 is used to perform page-turning scheduling on the index page to be scheduled if it is determined that there is a link omission according to the comparison result, until it is determined that there is no connection omission, and then perform subsequent scheduling operations; if it is determined that there is no link omission according to the comparison result, perform subsequent scheduling operate.

可选的，上述执行模块803，具体用于当获取的当前主体链接与历史主体链接不存在交集时，通知获取模块801获取待调度索引页下一页的当前主体链接，比较模块802返回执行将获取的当前主体链接与待调度索引页的历史主体链接进行比较的步骤，直至获取的当前主体链接与历史主体链接存在交集时，执行后续调度操作。Optionally, the above execution module 803 is specifically used to notify the acquisition module 801 to obtain the current subject link of the next page of the index page to be scheduled when there is no intersection between the acquired current subject link and the historical subject link, and the comparison module 802 returns to execute the A step of comparing the obtained current subject link with the historical subject link of the index page to be scheduled, until the acquired current subject link and the historical subject link overlap, and perform subsequent scheduling operations.

可选的，上述获取模块801，具体用于确定待调度索引页中的最大相似块，得到待调度索引页的当前主体链接。Optionally, the acquisition module 801 is specifically configured to determine the largest similar block in the index page to be scheduled, and obtain the current body link of the index page to be scheduled.

可选的，上述获取模块801，具体用于获取可扩展标记语言XML路径xpath相同的节点，得到相似块；根据相似块的位置和面积，确定出待调度索引页中的最大相似块。Optionally, the acquisition module 801 is specifically configured to acquire nodes with the same xpath in the Extensible Markup Language XML path to obtain similar blocks; and determine the largest similar block in the index page to be scheduled according to the position and area of the similar blocks.

可选的，上述获取模块801，具体用于根据相似块在调度索引页面中的宽度、高度、上边距、左边距确定所述相似块的位置，以及根据相似块在调度索引页面中的宽度、高度确定相似块的面积。Optionally, the acquisition module 801 is specifically configured to determine the position of the similar block according to the width, height, top margin, and left margin of the similar block in the scheduling index page, and determine the position of the similar block according to the width, height, and left margin of the similar block in the scheduling index page. The height determines the area of similar blocks.

可选的，上述获取模块801，具体用于确定相似块中面积最大且包含页面中心点的相似块为最大相似块。Optionally, the acquisition module 801 is specifically configured to determine that among the similar blocks, the similar block with the largest area and including the center point of the page is the largest similar block.

可选的，上述获取模块801，还用于获取待调度索引页中的翻页块，通过正则匹配链接的锚auchor，确定下一页链接的统一资源定位符URL，以便根据所述URL进行翻页抓取。Optionally, the above acquisition module 801 is also used to acquire the page turning block in the index page to be scheduled, and determine the Uniform Resource Locator URL of the link on the next page by matching the anchor auchor of the link regularly, so as to perform page turning according to the URL page fetch.

可选的，上述获取模块801，具体用于通过正则匹配链接的auchor所包含的信息，确定待调度索引页中的翻页块，获取确定的翻页块。Optionally, the acquisition module 801 is specifically configured to determine the page-turning block in the index page to be scheduled by matching the information contained in the linked auchor in a regular pattern, and acquire the determined page-turning block.

可选的，上述获取模块801，具体用于如果正则匹配链接的auchor包括翻页块特征信息，则基于翻页块信息确定待调度索引页中的翻页块。Optionally, the acquisition module 801 is specifically configured to determine the page-turning block in the to-be-scheduled index page based on the page-turning block information if the auchor of the regular matching link includes the feature information of the page-turning block.

可选的，上述获取模块801，具体用于正则匹配链接的auchor包含下一页链接信息时，将该auchor对应的链接确定为下一页链接的URL。Optionally, the acquisition module 801 is specifically configured to determine the link corresponding to the auchor as the URL of the next page link when the auchor of the regular matching link contains the next page link information.

可选的，上述比较模块802，还用于根据当前主体链接中各链接所包括的发布时间信息，顺序提取时间序列；根据相邻链接的发布时间信息，判断提取的时间序列是否具备顺序或逆序特征；当不具备时，将获取的当前主体链接中各链接按时间顺序或时间逆序进行排序。Optionally, the above-mentioned comparison module 802 is also used to sequentially extract the time series according to the release time information included in each link in the current subject link; and to judge whether the extracted time series have sequence or reverse order according to the release time information of adjacent links feature; when not available, sort the links in the acquired current subject links in chronological order or in reverse chronological order.

本发明实施例提供的资源搜索调度方法和装置，是针对特定索引页的调度机制，基于漏链判断实现调度，以最经济的流量成本来提高资源搜索的收录覆盖率。The resource search scheduling method and device provided by the embodiments of the present invention is a scheduling mechanism for a specific index page, realizes scheduling based on leaky link judgment, and improves resource search collection coverage at the most economical traffic cost.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the claims, any one of the claimed embodiments can be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的资源搜索调度设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the resource search and scheduling device according to the embodiment of the present invention. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

至此，本领域技术人员应认识到，虽然本文已详尽示出和描述了本发明的多个示例性实施例，但是，在不脱离本发明精神和范围的情况下，仍可根据本发明公开的内容直接确定或推导出符合本发明原理的许多其他变型或修改。因此，本发明的范围应被理解和认定为覆盖了所有这些其他变型或修改。So far, those skilled in the art should appreciate that, although a number of exemplary embodiments of the present invention have been shown and described in detail herein, without departing from the spirit and scope of the present invention, the disclosed embodiments of the present invention can still be used. Many other variations or modifications consistent with the principles of the invention are directly identified or derived from the content. Accordingly, the scope of the present invention should be understood and deemed to cover all such other variations or modifications.

基于本发明的一个方面，还公开了A1.一种资源搜索调度方法，包括：Based on one aspect of the present invention, A1. A method for resource search and scheduling is also disclosed, including:

A2.根据A1所述的方法，其中，若根据比较结果确定存在链接遗漏，对待调度索引页进行翻页调度，直至确定不存在连接遗漏时，具体包括：A2. The method according to A1, wherein, if it is determined according to the comparison result that there is a link omission, the index page to be scheduled is page-turned until it is determined that there is no link omission, specifically including:

A3.根据A1所述的方法，其中，所述获取待调度索引页的当前主体链接，具体包括：A3. The method according to A1, wherein said obtaining the current subject link of the index page to be scheduled specifically includes:

A4.根据A3所述的方法，其中，确定所述待调度索引页中的最大相似块，具体包括：A4. The method according to A3, wherein determining the largest similar block in the index page to be scheduled includes:

A5.根据A4所述的方法，其中，所述相似块的位置根据所述相似块在调度索引页面中的宽度、高度、上边距、左边距确定；A5. The method according to A4, wherein the position of the similar block is determined according to the width, height, upper margin, and left margin of the similar block in the scheduling index page;

A6.根据A4所述的方法，其中，根据相似块的位置和面积，确定出所述待调度索引页中的最大相似块，具体包括：A6. The method according to A4, wherein, according to the position and area of the similar block, determine the largest similar block in the index page to be scheduled, specifically including:

A7.根据A1所述的方法，其中，所述对待调度索引页进行翻页调度之前，还包括：A7. The method according to A1, wherein, before performing paging scheduling on the index page to be scheduled, it also includes:

A8.根据A7所述的方法，其中，获取所述待调度索引页中的翻页块，具体包括：A8. The method according to A7, wherein obtaining the page turning block in the index page to be scheduled specifically includes:

A9.根据A8所述的方法，其中，如果所述正则匹配链接的auchor包括翻页块特征信息，则基于所述翻页块信息确定所述待调度索引页中的翻页块。A9. The method according to A8, wherein if the auchor of the regular matching link includes the feature information of the page turning block, then determine the page turning block in the index page to be scheduled based on the page turning block information.

A10.根据A7所述的方法，其中，所述通过正则匹配链接的锚auchor，确定下一页链接的统一资源定位符URL，具体包括：A10. The method according to A7, wherein the URL of the next page link is determined by matching the anchor auchor of the regular link, specifically including:

A11.根据A1-A10任一所述的方法，其中，所述将获取的当前主体链接与所述待调度索引页的历史主体链接进行比较之前，还包括：A11. The method according to any one of A1-A10, wherein, before comparing the obtained current subject link with the historical subject link of the index page to be scheduled, it also includes:

基于本发明的另一方面，还公开了B12.一种资源搜索调度装置，包括：Based on another aspect of the present invention, B12. A resource search scheduling device is also disclosed, comprising:

B13.根据B12所述的装置，其中，所述执行模块，具体用于：B13. The device according to B12, wherein the execution module is specifically used for:

B14.根据B12所述的装置，其中，所述获取模块，具体用于：B14. The device according to B12, wherein the acquisition module is specifically used for:

B15.根据B14所述的装置，其中，所述获取模块，具体用于：B15. The device according to B14, wherein the acquisition module is specifically used for:

B16.根据B15所述的装置，其中，所述获取模块，具体用于：B16. The device according to B15, wherein the acquisition module is specifically used for:

B17.根据B15所述的装置，其中，所述获取模块，具体用于：B17. The device according to B15, wherein the acquisition module is specifically used for:

B18.根据B12所述的装置，其中，所述获取模块，还用于：B18. The device according to B12, wherein the acquisition module is also used for:

B19.根据B18所述的装置，其中，所述获取模块，具体用于：B19. The device according to B18, wherein the acquisition module is specifically used for:

B20.根据B19所述的装置，其中，所述获取模块，具体用于：B20. The device according to B19, wherein the acquisition module is specifically used for:

B21.根据B18所述的装置，其中，所述获取模块，具体用于：B21. The device according to B18, wherein the acquisition module is specifically used for:

B22.根据B12-B21任一所述的装置，其中，所述比较模块，还用于：B22. The device according to any one of B12-B21, wherein the comparison module is also used for:

Claims

1. a resource searching dispatching method, comprising:

Obtain the current topic link of waiting to dispatch index page;

The described current topic link obtained is linked with described history body of waiting to dispatch index page and compares;

If determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations;

If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.

2. method according to claim 1, wherein, if determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, specifically comprise:

When the link of described current topic and history body link there is not common factor time, the current topic link of dispatching one page under index page is waited described in acquisition, and return to perform and described with described, the link of the current topic of acquisition is waited that the history body of dispatching index page links the step compared, until described current topic links and history body links and exists when occuring simultaneously.

3. method according to claim 1 and 2, wherein, the described current topic waiting to dispatch index page that obtains links, and specifically comprises:

Wait the maximal phase seemingly block dispatched in index page described in determining, obtain the current topic link of waiting to dispatch index page.

4. according to the arbitrary described method of claim 1-3, wherein, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:

Obtain the node that expandable mark language XML path xpath is identical, obtain similar piece;

According to position and the area of similar piece, described in determining, wait the maximal phase seemingly block dispatched in index page.

5. according to the arbitrary described method of claim 1-4, wherein, the position of described similar piece is according to described similar piece of width in scheduling index pages, highly, top margin, the left side is apart from determining;

The area of described similar piece is being dispatched the width in index pages according to described similar piece, is highly being determined.

6., according to the arbitrary described method of claim 1-5, wherein, according to position and the area of similar piece, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:

Determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.

7. according to the arbitrary described method of claim 1-6, wherein, described in treat before scheduling index page carries out page turning scheduling, also comprise:

Wait described in acquisition to dispatch the page turning block in index page, by the anchor auchor of canonical coupling link, determine the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.

8. according to the arbitrary described method of claim 1-7, wherein, wait described in acquisition to dispatch the page turning block in index page, specifically comprise:

By the information that the auchor of canonical coupling link comprises, wait described in determining to dispatch the page turning block in index page, obtain the page turning block determined.

9. according to the arbitrary described method of claim 1-8, wherein, if the auchor of described canonical coupling link comprises page turning block feature information, then wait described in determining based on described page turning block message to dispatch the page turning block in index page.

10. a resource searching dispatching device, comprising:

Acquisition module, for obtaining the current topic link of waiting to dispatch index page;

Comparison module, compares for the described current topic link obtained being linked with described history body of waiting to dispatch index page;

Execution module, if for determining that according to comparative result there is link omits, treating scheduling index page and carrying out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations; If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.