CN103475688A

CN103475688A - Distributed method and distributed system for downloading website data

Info

Publication number: CN103475688A
Application number: CN2013101986884A
Authority: CN
Inventors: 吴自立
Original assignee: Beijing Netqin Technology Co Ltd
Current assignee: Beijing Netqin Technology Co Ltd
Priority date: 2013-05-24
Filing date: 2013-05-24
Publication date: 2013-12-25

Abstract

The invention provides a distributed method and a distributed system for downloading website data. The system comprises a controller, a database and multiple downloading modules. The controller reads one or more addresses in an initial address list, and stores the one or more addresses into at least one downloading queue in multiple downloading queues maintained in the database. The database is used for maintaining the multiple downloading queues which at least comprise a first downloading queue and a second downloading queue, wherein the downloading queues correspond to data of different types. The multiple downloading modules are used for respectively downloading data of multiple types according to the multiple downloading queues, parsing a to-be-downloaded subsequent address according to the data downloaded by the first downloading queue, determining which downloading queue the subsequent address belongs to, and according to results of the determination, storing the subsequent address into the corresponding downloading queue. The above process is repeatedly conducted until preset conditions are satisfied.

Description

Distributed method and system for the download site data

Technical field

Present invention relates in general to distributed data and download, relate more specifically to distributed method and system for the download site data.

Background technology

In current searching engine field, in order more fast and effeciently the webpage on the Internet to be searched for, usually need to be come to capture/crawl from each website in advance by search engine provider/download (hereinafter, these terms can be used as the synonym Alternate) various webpage/application programs/audio/video etc. and analyzed.In order to realize that this webpage captures, and inevitably will use various webpage capture programs (be also referred to as web crawlers, Web Spider etc., hereinafter be referred to as web crawlers).

Existing web crawlers generally starts downloading web pages by some initial addresses, then by resolving the content of pages of downloading, decides the follow-up page crawled.Usually need in the reptile program, realize that crawls a queue, for depositing initial address and the follow-up address that crawls parsed.In addition, also need to realize that one for depositing the duplicate removal filter of the page address of downloading, repeat to crawl avoiding.

With the reptile program of unit, compare, can be deployed in simultaneously distributed network reptile on a plurality of nodes the use of computer hardware resource and/or network bandwidth resources and with the processing of ip, region relevant issues on can be more flexible.Existing distributed network reptile crawls queue and the duplicate removal filter is realized by sharing usually.

Yet at least there is following problem in existing crawler system: (1) is not supported to download dissimilar data by different download configuration; (2) its duplicate removal filter is not supported the term of validity, that is, it is not supported that the webpage that can upgrade some is repeated and crawls; (3) do not support crawling regular setting and to the analysis of JavaScript (hereafter is JS).

In addition, also at least there is following problem in existing distributed reptile system: (1) is difficult to unified plan and configuration, because the distributed reptile system is deployed on many computers usually, therefore each reptile node (module) managed concentratedly and unified to configure very difficult.Especially, when needs are crawled a plurality of websites, usually need to start, login while stopping crawling website crawler server, the operations such as the configuration of modifying, start/stop associated process.When Websites quantity is larger, these processes become very loaded down with trivial details and easily make mistakes, (2) integral body that is difficult to arrange for specific website crawls interval, usually need to control the access interval of web crawlers to website, to avoid affecting website, run well, minimizing causes too large burden to website, yet prior art only can be limited crawling interval on single reptile program, all distributed reptile nodes can't be done to as a whole control, in other words, although can limit the access frequency separately of single reptile to same website, but be difficult to avoid occurring that a lot of reptiles simultaneously or almost simultaneously access the situation of this same website, in this case, the normal operation of website has been subject to the interference on burst access peak, likely make delay machine or roll off the production line of website when serious, and (3) be difficult to collect statistical information, can't directly obtain the general evaluation system information of distributed reptile, usually need to build in addition a set of log system and realize this function.

Summary of the invention

Therefore, the invention provides the distributed method for the download site data and the system at least solved the problems of the technologies described above.

Particularly, according to the first scheme of the present invention, provide a kind of distributed method for the download site data.The method comprises the following steps: a) the one or more addresses in the read start address list, and described one or more addresses are deposited in a plurality of download queues of safeguarding in database at least one download queue, wherein, described a plurality of download queue at least comprises the first download queue and the second download queue, and each downloads queue corresponding to dissimilar data; B) download respectively the data of a plurality of types according to described a plurality of download queues; C) parse the subsequent address that will download from the data of downloading according to described the first download queue; D) judge which described subsequent address belongs to and download queue; E) according to steps d) in judged result, described subsequent address is deposited in the respective downloaded queue; And f) repeat above-mentioned steps b) to e), until meet predetermined condition.

In certain embodiments, step b) comprising: according to the download interval for each website of centralized maintenance in described database, according to the record in the respective downloaded queue, from corresponding website downloading data.

In certain embodiments, described the first download queue is the page download queue, and described the second download queue is that queue is downloaded in application.

In certain embodiments, the website that will download for each, predefine by a plurality of download modules unified that use and for the different page types of described website separately regulation one or more crawl rule.

In certain embodiments, steps d) comprising: when described, when crawling rule and comprising the first attribute of the type of indicating described subsequent address, according to described the first attribute, judge which described subsequent address belongs to and download queue.

In certain embodiments, steps d) also comprise: when described, when crawling rule and not comprising the first attribute of the type of indicating described subsequent address, judge that described subsequent address belongs to the second download queue.

In certain embodiments, step e) comprising: further judge whether described subsequent address meets by the described defined regular expression of the second attribute crawled in rule, if meet described regular expression, will be spliced into the address that will deposit in the respective downloaded queue as each character string of the result of described subsequent address and described matching regular expressions, otherwise skip described subsequent address.

In certain embodiments, step c) comprising: according to described the 3rd attribute that crawls in rule the character string constitutive characteristic for describing subsequent address, from the data of downloading, parse the subsequent address that will download.

In certain embodiments, described the 3rd attribute is the xpath expression formula.

In certain embodiments, step b) comprising: read a record from the first download queue; The first duplicate removal list that use is safeguarded in described database, carry out duplicate removal to described record; And download corresponding data according to described record.

In certain embodiments, the step of described record being carried out to duplicate removal comprises: during record matching in determining described record and described the first duplicate removal list, skip described record and read next record from described the first download queue.

In certain embodiments, at step e) before, described method is further comprising the steps of: upgrade described the first duplicate removal list with the address in described record; And step e) comprising: download queue in the situation that described subsequent address belongs to described first, use described the first duplicate removal list after upgrading to described subsequent address duplicate removal, and deposit it in described first and download queue; And download queue in the situation that described subsequent address belongs to described second, and use the second duplicate removal list of safeguarding in described database to described subsequent address duplicate removal, described the second duplicate removal list of corresponding renewal, and deposit it in described the second download queue.

In certain embodiments, step b) comprising: according to priority from described second, download queue and read address to be downloaded; The 3rd duplicate removal list that use is safeguarded in described database is carried out duplicate removal to described address to be downloaded; The data of being pointed to by described address to be downloaded are downloaded; And upgrade described the 3rd duplicate removal list.

In certain embodiments, step b) also comprise: the data of having downloaded are checked; Calculate the digital digest of the data of having downloaded; Based on described digital digest, use the 4th duplicate removal list of safeguarding in described database to carry out duplicate removal to the data of having downloaded; Upgrade described the 4th duplicate removal list; And the described data of having downloaded are deposited in file system.

In certain embodiments, every record in described the first duplicate removal list, described the second duplicate removal list and described the 3rd duplicate removal list includes the term of validity, when using each duplicate removal list appropriate address to be carried out to duplicate removal and exist with the recording of appropriate address coupling in the duplicate removal list, said method comprising the steps of: whether the term of validity that judges described record expires, if expire, appropriate address is not carried out to duplicate removal, if not yet due, appropriate address is carried out to duplicate removal.

In certain embodiments, the described rule that crawls also comprises the 4th attribute, when the value of described the 4th attribute is "True", by the record to comprising the described number of times that crawls regular use of described the 4th attribute, identify current download the list of application that is linked at corresponding website which the page on.

In certain embodiments, when upgrading described the first duplicate removal list, described the second duplicate removal list or described the 3rd duplicate removal list, in the situation that on the page of the predetermined number of pages that is linked at list of application of current download, with in situation on the page after the predetermined number of pages that is linked at list of application of current download, compare, in the duplicate removal list, the term of validity of corresponding record is set to larger.

In certain embodiments, described predetermined number of pages is less than or equal to 5 pages.

In certain embodiments, provide to website data and download relevant statistical information by described database.

In certain embodiments, described statistical information comprises following one or more: the length of each duplicate removal list or the total length of each duplicate removal list in the current length of described a plurality of download queues, fixed time section.

In certain embodiments, described database is key-value type database.

In certain embodiments, described database is the Redis database.

In certain embodiments, described predetermined condition is that all download queues are sky, reach and be scheduled to crawl the degree of depth or the combination of the two.

According to a second aspect of the invention, provide a kind of distributed system for the download site data.This system comprises: controller, and the one or more addresses in the read start address list, and described one or more addresses are deposited in at least one the download queue in a plurality of download queues of safeguarding in database; Database, for safeguarding a plurality of download queues, wherein, described a plurality of download queues at least comprise the first download queue and the second download queue, and each downloads queue corresponding to dissimilar data; A plurality of download modules, for download respectively the data of a plurality of types according to described a plurality of download queues, parse the subsequent address that will download from the data of downloading according to described the first download queue, judge which described subsequent address belongs to and download queue, according to the result of described judgement, described subsequent address is deposited in the respective downloaded queue; And the repetition said process, until meet predetermined condition.

In certain embodiments, described a plurality of download modules are further configured to: according to the download interval for each website of centralized maintenance in described database, according to the record in the respective downloaded queue, from corresponding website downloading data.

In certain embodiments, described download module is configured to: when described, when crawling rule and comprising the first attribute of the type of indicating described subsequent address, according to described the first attribute, judge which described subsequent address belongs to and download queue.

In certain embodiments, described download module also is configured to: when the described rule that crawls, while not comprising the first attribute of the type of indicating described subsequent address, judge that described subsequent address belongs to the second download queue.

In certain embodiments, described download module is configured to: further judge whether described subsequent address meets by the described defined regular expression of the second attribute crawled in rule, if meet described regular expression, will be spliced into the address that will deposit in the respective downloaded queue as each character string of the result of described subsequent address and described matching regular expressions, otherwise skip described subsequent address.

In certain embodiments, described download module is configured to: according to described the 3rd attribute that crawls in rule the character string constitutive characteristic for describing subsequent address, from the data of downloading, parse the subsequent address that will download.

In certain embodiments, described download module is configured to: read a record from the first download queue; The first duplicate removal list that use is safeguarded in described database, carry out duplicate removal to described record; And download corresponding data according to described record.

In certain embodiments, described download module also is configured to: during record matching in determining described record and described the first duplicate removal list, skip described record and read next record from described the first download queue.

In certain embodiments, described download module is configured to: with the address in described record, upgrade described the first duplicate removal list; Download queue in the situation that described subsequent address belongs to described first, use described the first duplicate removal list after upgrading to described subsequent address duplicate removal, and deposit it in described first and download queue; And download queue in the situation that described subsequent address belongs to described second, and use the second duplicate removal list of safeguarding in described database to described subsequent address duplicate removal, described the second duplicate removal list of corresponding renewal, and deposit it in described the second download queue.

In certain embodiments, described download module also is configured to: according to priority from described second, download queue and read address to be downloaded; The 3rd duplicate removal list that use is safeguarded in described database is carried out duplicate removal to described address to be downloaded; The data of being pointed to by described address to be downloaded are downloaded; And upgrade described the 3rd duplicate removal list.

In certain embodiments, described download module also is configured to: the data of having downloaded are checked; Calculate the digital digest of the data of having downloaded; Based on described digital digest, use the 4th duplicate removal list of safeguarding in described database to carry out duplicate removal to the data of having downloaded; Upgrade described the 4th duplicate removal list; And the described data of having downloaded are deposited in file system.

In certain embodiments, every record in described the first duplicate removal list, described the second duplicate removal list and described the 3rd duplicate removal list includes the term of validity, when using each duplicate removal list appropriate address to be carried out to duplicate removal and exist with the recording of appropriate address coupling in the duplicate removal list, described download module also is configured to:

Whether the term of validity that judges described record expires, if expire, appropriate address is not carried out to duplicate removal, if not yet due, appropriate address is carried out to duplicate removal.

In certain embodiments, when upgrading described the first duplicate removal list, described the second duplicate removal list or described the 3rd duplicate removal list, in the situation that on the page of the predetermined number of pages that is linked at list of application of current download, with in situation on the page after the predetermined number of pages that is linked at list of application of current download, compare, in the list of described download module duplicate removal, the term of validity of corresponding record is set to larger.

In certain embodiments, described database is key-value type database.

In certain embodiments, described database is the Redis database.

The method and system that the application of the invention provides, can be according to the type of the data that will obtain, for example (but being not limited to): web page files (for example, html, xhtml, shtml etc.), application program (for example, the exe file moved on Windows, for apk application package of Android system etc.), document files (for example, word file, pdf document etc.), dynamic image file (for example, avi file, MP4 file etc.), static image file (for example, the JPG file, BMP file etc.), audio file (MP3, WMA etc.), and other various data types (for example, Flash etc.), by reptile (downloader), adopt different downloads and handling process to be processed.Take full advantage of the different qualities of these different types of data on the aspects such as processor/memory resource and/or Internet resources.

In addition, the method and system that the application of the invention provides, can provide for any term of validity of record in the duplicate removal list, makes repeating to crawl and becoming possibility of the webpage that may upgrade some.

In addition, the method and system that the application of the invention provides, can, by using xpath expression formula and regular expression, process the problem of the network address of expressing with the JS form in the web page files to a great extent.

In addition, the method and system that the application of the invention provides, support by a plurality of reptile modules unified that use and for different page types, provide separately crawl rule, make when having simplified the maintenance regular to crawling of a plurality of reptile modules, allow and crawl the more complicated setting of rule support, to crawl rule for each website design more easily.Especially, by allowing, crawl rule support page turning rule, make the user to take the different strategies that crawls strategy and the list of renewal duplicate removal for different numbers of pages.

In addition, the method and system that the application of the invention provides, operate the reptile module by special controller and database, comprising: start/stop/check state/modification configuration etc., and without being operated respectively on the node disposing each reptile module.

In addition, the method and system that the application of the invention provides, by the unified list of safeguarding for each crawl/download group of safeguarding in database, can be limited the interval that crawls for some websites of all reptiles (downloader).In addition, by the queue that crawls of unified maintenance in database, support a plurality of distributed reptiles to adopt the identical queue that crawls.

In addition, the method and system that the application of the invention provides, statistical function by each list/queue for safeguarding in database, can obtain ASSOCIATE STATISTICS information, include, but is not limited to: current crawling crawls the number of files of page number/parsings download link number/download number/newly etc. in queue/download queue size and overall/special time period.

In other words, the method and system that the application of the invention provides, can solve or at least alleviate the aforementioned technical problem that prior art has.

The accompanying drawing explanation

By below in conjunction with accompanying drawing explanation the preferred embodiments of the present invention, will make of the present invention above-mentioned and other objects, features and advantages are clearer, wherein:

Fig. 1 shows according to an embodiment of the invention the simplified block diagram for the example distribution formula system of download site data;

Fig. 2 shows the schematic diagram of some sample data structures of safeguarding in the database shown in Fig. 1;

Fig. 3～9 show the screenshot capture that crawls according to an embodiment of the invention regular instantiation procedure for design example;

Figure 10 shows according to an embodiment of the invention the flow chart that crawls part for the example of the distributed method of download site data; And

Figure 11 shows according to an embodiment of the invention the flow chart for the download part of the example of the distributed method of download site data.

In institute of the present invention drawings attached, same or analogous structure all identifies with same or analogous Reference numeral.

Embodiment

To a preferred embodiment of the present invention will be described in detail, in the description process, having omitted is unnecessary details and function for the present invention with reference to the accompanying drawings, to prevent that the understanding of the present invention from causing, obscures.Below, describe method and system of the present invention in detail in connection with specific embodiment.But those skilled in the art are to be understood that: each specific embodiment described below does not limit the scope of the invention; On the contrary, these specific embodiments should be interpreted as and be included in the spirit and scope of the present invention that are defined by the following claims, and those skilled in the art can in the situation that not break away from the spirit and scope of the present invention that are defined by the following claims, make various modifications, interpolation, deletion or replace the present invention.

At first, referring to Fig. 1, it shows according to an embodiment of the invention the simplified block diagram for the example distribution formula system 10 of download site data.Distributed system 10 comprises: controller 110, database 120, a plurality of reptile 130_1～130_N (N represents the number of reptile) and a plurality of downloader 140_1～140_M (M represents the number of downloader).Controller 110 is connected with database 120, and database 120 is connected with each downloader 140_1～140_M with each reptile 130_1～130_N, the stelliform connection topology configuration of formation centered by database 120, and then controller 110 can control each reptile 130_1～130_N and each downloader 140_1～140_M by database 120, and each reptile 130_1～130_N and each downloader 140_1～140_M can be by database 120 to the various information of controller 110 feedback.Certainly, in further embodiments, controller 110 can directly be connected and communicate by letter with each downloader 140_1～140_M with each reptile 130_1～130_N.

In the present embodiment, database 120 is key-value type databases, more preferably, is the Redis database.Certainly, database 120 can be also other key-value type database, for example (but being not limited to): Berkeley DB, MemCacheDB, BigTable, Dynamo, Tokyo Cabinet etc.In addition, database 120 can be also other non-key-value type database, such as (but being not limited to): MySQL, Microsoft SQL, Oracle RDBMS etc.

Yet, with traditional relevant database, compare, key-value type database can provide higher inquiry velocity, larger data capacity, higher number of concurrent, be more suitable for being inquired about by major key, therefore preferably uses key-value type database that store and management of queue/list/configuration/data file etc. is provided for modules of the present invention.In addition, because the Redis database provides plurality of data structures (comprising character string (string), doubly linked list (list), set (set) and ordered set (zset)), therefore it is compared with other key-value type database, is more suitable for realizing method and system of the present invention.To describe hereinafter by these data structures, particularly ordered set (zset), realize the various data structures of storage in database 120 according to an embodiment of the invention.

For the convenient understanding to hereinafter, below briefly introduce the structure of zset.Zset is the data structure of a collection class, and wherein each element all additionally has a sequential attribute (z value, the score value is otherwise known as).Can when adding or revising element, specify this attribute.After each the appointment, zset can readjust the order of each element in zset automatically by new z value.Can follow the order according to the appointment of z value, with very high efficiency, the element in zset be carried out to access.More details can be with reference to the operation instruction of zset: http://redis.io/commands#sorted_set.

Now, return to Fig. 1, will continue to describe distributed system 10.In the present embodiment, distributed system 10 is mainly used in the webpage of appointed website and the application program that provides in webpage are downloaded, and the webpage downloaded to is analyzed, and therefrom extracting can be for the webpage of subsequent download and/or the address of application program.In another example, distributed system 10 also can be downloaded for the data to other types, such as webpage and the video that provides in webpage are downloaded, analysis etc.In another example, distributed system 10 also can for example, for data (, webpage, application program, audio frequency etc.) a kind of or more than three kinds be downloaded, analysis etc.

In distributed system 10, the database interface that controller 110 provides by database 120 is processed each data structure of maintenance in database 120 etc., thereby indirectly controls each reptile 130_1～130_N and each downloader 140_1～140_M.

In the present embodiment, reptile 130_1～130_N and downloader 140_1～140_M are based on the reptile framework scrapy1.6 exploitation of increasing income, but embodiments of the invention are not limited to this, also can use other reptile program.

Therefore, in the distributed system 10 shown in Fig. 1, major function and the occupation mode of modules are as shown in table 1.

Table 1

According to table 1, can see, distributed system 10 be take the core that database 120 is that flow process is controlled, and each intermodule of system is separate, by operating database 120, carrys out exchange message.

Next, with reference to Fig. 2 illustrate in database 120 safeguard for the list of coordinating and control each reptile 130_1～130_N and each downloader 140_1～140_M and the concrete example of queue.Referring to Fig. 2, safeguarded following data structure in database 120: (L represents the website number that will download to queue (first the downloads queue) 121_1～121_L that crawls of the website that will download for each, lower same) and download queue (second downloads queue) 122_1～122_L, the reptile duplicate removal list of the website that will download for each (the first duplicate removal list) 123_1～123_L, resolve duplicate removal list (the second duplicate removal list) 124_1～124_L, download duplicate removal list (the 3rd duplicate removal list) 125_1～125_L and file duplicate removal list (the 4th duplicate removal list) 126_1～126_L, the initial address list 127_1～127_L of the website that will download for each, restrict access list 128_1～128_P (number of crawl/download group of P representative) for each crawl/download group, type configuration list 129_1～129_Q (Q represents the number of type of webpage) for each type of webpage, website configured list 130_1～130_L for each website, and for the config update list 131 of each type configuration list and website configured list.

Next, form and the purposes of above-mentioned each data structure will be described in detail.

(1) the crawling queue 121_1～121_L and download queue 122_1～122_L of the website that will download for each

The website that need to crawl for each, safeguard that in database 120 one crawls queue and a download queue separately, and it waits for preserving the network address that crawls/download.These two queues all adopt zset to realize, the key name of these two queues can be specified by " crawl_queue_ "+" website name " and " down_queue_ "+" website name " respectively.The negative value of the crawl priority (hereinafter to be referred as priority) that z value corresponding to each member in queue (pending address) is the corresponding network address.By obtaining the minimum element (by the zrange order) of z value in set, can obtain the highest member of queue medium priority.

(2) the reptile duplicate removal list 123_1～123_L of the website that will download for each, parsing duplicate removal list 124_1～124_L, download duplicate removal list 125_1～125_L and file duplicate removal list 126_1～126_L

Above-mentioned multiple duplicate removal list has been safeguarded in the website that need to crawl for each in database 120.These duplicate removal lists all adopt zset to realize, key name is specified (different duplicate removal lists adopts different preset characters strings) by this mode of " preset characters string "+" website name ".In the duplicate removal list, z value corresponding to each member is the corresponding element last time of timestamp when processed.

For there not being the term of validity address of (that is, not needing to repeat to crawl), by judging in list, whether exist this element (zrank order) can judge whether this element repeats.For the address that has the term of validity, obtain the z value of corresponding element by the zscore method, then judge whether this timestamp+term of validity is greater than the current time, can judge this element and whether repeat, will describe in detail below.

The support of zset data structure is inquired about specific z value and is dropped on all elements (by the zrangebyscore order) interior between given zone.Controller 110 has utilized this characteristic while obtaining the statistical information of crawler system, will describe in detail below.

(3) the initial address list 127_1～127_L of the website that will download for each

The initial address list of a correspondence is safeguarded in the website that need to crawl for each in database 120.This list adopts zset to realize, key name is " start_url_ "+" website name ".In the initial address list, z value corresponding to each member can be pushed into the minimum time that crawls queue for this address next time.Reptile (or controller or any other module) is to fix or variable time interval takes out the member that the z value is less than current time (zrangebyscore order) from this list, it is pushed and crawls queue, and the z value of respective items in the initial address list is updated to current time+term of validity.

(4) for the restrict access list 128_1～128_P of each crawl/download group

Being divided into groups in the one or more websites that can download, they is divided into to crawl/download group.For each crawl/download group, safeguard the list of a correspondence in database 120.This list adopts zset to realize, key name is respectively " crawl_group_ "+" website name " and " down_group_ "+" website name ".In list, z value corresponding to each member allows for this website the time that crawls next time/download.Z value in the restrict access list of the crawl/download group of reptile/downloader by inquiring about its place is less than the member of current time (by for example zrangebyscore order), can obtain the web site name that need to crawl/download.By this restrict access list, can group be set respectively separately for the access frequency of specific website for different crawling/download group, thereby effectively control more flexibly each reptile/downloader.

(5) for the type configuration list 129_1～129_Q of each type of webpage, for the website configured list 130_1～130_L of each website

For each type configuration/website configuration, safeguard the value of a String type in database 120, its key name is respectively " type_config_ "+" type name " and " spider_config_ "+" website name ", the character string forms that content is corresponding xml node in configuration file, hereinafter will describe in detail.

(6) for the config update list 131 of each type configuration list and website configured list

This list is for depositing the lastest imformation of configuration, and after already present type configuration/website configuration changes, corresponding information will be stored in this list, to notify each module, be upgraded.This list adopts zset to realize, key name is " config_update_list ".Its member is the configuration file that upgraded corresponding key name in database 120, and the z value is update time.The system operational blocks which partition system all can be recorded start-up time as the final updating time after starting.Check once this list every one period set time (1000 seconds) afterwards.If there is the member who is greater than the program final updating time update time (z value), is about to configuration corresponding to this member and again reads in renewal.After checking out this list, by the final updating set of time, it is the current time, to avoid repeating renewal.

In addition, the naming rule of the key name of above-mentioned all list/queue is not limited to the character string that above-mentioned example provides, and it can be also for distinguishing other character strings of queue/list.

Next, describe in detail for design example and crawl regular instantiation procedure with reference to Fig. 3～9.In the present invention, crawl rule for instructing reptile (module) how to crawl website and parsing follow-up crawling/download address.In this example, download the characteristics of class website for software, native system, by by Web page classifying, adopts the different rules that crawls to be crawled to the dissimilar page.Certainly, also can, for the characteristics of the website of other types to crawling rule designed, the invention is not restricted to this.Crawl rule and mainly specify by the xpath expression formula, and can filter some with regular expression and not wish the page crawled.

Particularly, in the present embodiment, demonstrate how to design crawling regular example with the Firebug plug-in unit of Firefox browser and Firepath plug-in unit.Can certainly or come Demonstrate Design to crawl regular process with other development interface or debugging acid with the relevant plug-in unit of other xpath of other browsers.In addition, in the present embodiment, design and crawl rule for single website (" peace utmost point market "), but be to be understood that: also can design similarly and crawl rule for one or more websites, as long as it can realize identical purpose.

At first take “An Ji market, website " be example, usually as described belowly obtain crawling rule.Referring to Fig. 3, reptile starts to capture the page from start page http://apk.angeeks.com, and the page type of this page is defined as to " home ".By the expression formula based on the xpath grammer

//dd[@id＝′vtit′]//a

Choose the page type " list_ " that each master list page will crawl as the next one.

Expression formula " //dd[@id=' vtit '] //a " semanteme mean to choose in current web page " definition list " (definition list, grammer is " dd ") lower all referenced items (reference, grammer is " a "), and the attribute " id " of these definition list items is character string " vtit ".The concrete syntax of xpath please refer to the operation instruction of xpath and html, for example, respectively referring to http://www.w3school.com.cn/h.asp and http://www.w3school.com.cn/xpath.

As shown in Figure 3, by above-mentioned xpath expression formula, chosen the new line list items on " peace utmost point market " homepage, for example " homepage ", " game ", " application " etc. identifies selected item with the dotted line frame in webpage.Can in the latter half of Fig. 3, see that the list items extracted has the form of web page interlinkage (address) wholly or in part, for example " ../index.jsp ", " http://apk.angeeks.com/list/c-8.html ".Please note: " homepage " herein, " installation is indispensable ", " type prefecture " these three master list pages link not our needed download list page, in order to reduce unnecessary crawling, can to the address extracted, further filter by regular expression (.*list.*), to choose in link the link that comprises character string " list ", thereby three link filters such as " homepage ", " installation is indispensable ", " type prefecture " are fallen.In view of the grammer of regular expression is well known to the skilled person, the application will repeat no more.

Like this, by above-mentioned xpath expression formula and regular expression, " game ", " application ", " e-book ", " theme " and five pages such as " wallpapers " have been chosen.Next, " game " page of take carries out follow-up explanation as example.Referring to Fig. 4, it shows the subsequent operation of carrying out in games page.Particularly, as described with reference to Figure 3, the page type of " game ", " application ", " e-book ", " theme " and five master list pages such as " wallpapers " is defined as to " list_ ".Although the page of " list_ " type is also (master) list page, but due to hope, the capture program bag is (for example in chronological order, for the preferential program package that captures recent renewal), so this type of page be not suitable as the list page of obtaining download address.Certainly, for other websites, the page that likely acquiescence provides is exactly the list page of arranging in chronological order, does not now also just need the step shown in Fig. 4.This should depend on the specific design of website to be crawled fully.

Return to Fig. 4, in order to choose " by sequence update time " on " game " page, use following xpath expression formula

//div[@class＝′mytak′]/p[last()]/a

Choose " by sequence update time " link, and be " list " type by obtained link definition.As shown in Figure 4, identify selected item with the dotted line frame in webpage.

Choosing after " by sequence update time ", also need two to crawl rule and carry out subsequent operation.Article one, crawl rule and be responsible for resolving the information page (that is, the page of relevant information and the download link of the application software that will download is provided) of all application software that will download on the current list page.Second crawls rule and is responsible for the current list page is translated into to lower one page, is responsible for obtaining next original list.

Crawl rule for above-mentioned article one, referring to Fig. 5, it shows the step of obtaining the information page of all application software to be downloaded on the current list page.Use the xpath expression formula herein,

//div[@class＝′info′]/a

Obtain the page that type is defined as each application software to be downloaded of " topic ".As shown in Figure 5, chosen each software information page under " by sequence update time " pattern on " game " page by this xpath expression formula, for example " Mechanical Zuma ", " professional billiards (Chinese edition) ", " Chinese chess wizard " etc. identify selected item with the dotted line frame in webpage.Can in the latter half of Fig. 5, see that the information page extracted has the form of the web page interlinkage (address) of part, for example "/soft/10119542.html ".Certainly, for other websites, the web page address of information page is not limited to the part address, and it can be also address completely, and this depends on the specific design of website fully.

Crawl rule for above-mentioned second, referring to Fig. 6, it shows the step of the web page address that obtains on the current list page ">" button that points to lower one page.Use the xpath expression formula herein,

//div[@class＝′pagenumt′]/a[last()-1]

Obtain the next list page that type also is defined as " list ".As shown in Figure 6, by this xpath expression formula, choose the corresponding web page address of ">" button on the webpage, in webpage, with the dotted line frame, identified selected item.Therefore, second crawls rule and is also referred to as " page turning rule ".Can in the latter half of Fig. 6, see that the list page extracted has the form of the web page interlinkage (address) of part, for example "/list/c-8-2-1.html ".

Like this, by repeatedly applying, above-mentioned article one crawls rule and second crawls rule, can obtain the link of information page (download page) of all application software of " game " page under " by sequence update time " pattern of " peace utmost point market ".It should be noted that, when according to above-mentioned second rule, translating into last page, owing to using that in fact address that above-mentioned xpath expression formula gets is pointed to or last page (, this current page constantly), therefore will be by duplicate removal list or other termination means (for example, another crawls rule or regular expression etc.) finish the resolving shown in Fig. 6, the invention is not restricted to above-mentioned termination means.Certainly, should design and use for different web sites and crawl accordingly rule.

Next, describe in detail and how to obtain on each information page for downloading the address of corresponding software with reference to Fig. 7.Referring to Fig. 7, it take software " professional billiards (Chinese edition) " is example, shows the step of obtaining " professional billiards (Chinese edition) " download link by the xpath expression formula.In Fig. 7, by using the xpath expression formula

//img[@src＝″/images/wandoujiabutton.jpg″]/..

Obtain the download link of software.Chosen " key is installed " the corresponding web page address of button identified with the dotted line frame on the webpage by this xpath expression formula.Can in the latter half of Fig. 7, see that the download page extracted has the form of complete web page interlinkage (address), for example " http://apk.angeeks.com/downloadAPK.do? soft_id=10119517 ".

Like this, by the described method step in Fig. 3～7, can draw for each of " peace utmost point market " and crawl rule.Crawl rule according to these, can be obtained the application download list by sequence update time of each software by reptile, deposit it in download queue, and downloaded according to downloading queue by downloader.

In addition, it is worth mentioning that most application software downloading page at the details page of its software (in the present embodiment, " topic " page) all can provide a large amount of software relevant informations on, adopt the mode (xpath expression formula) of similar extraction download link very easily the relevant information of the software kit of download to be obtained in the lump.

Next, in order to configure reptile module and download module by configuration file, need the top rule that crawls generated for specific website is standardized.For example, in the situation that define configuration file with the xml form, can be a spider node in configuration file by the above-mentioned rule definition that crawls.For example as follows:

Next explain in detail every semanteme in this spider node.

(1) spider node

" name " attribute of spider node, for mean the abbreviation of website in internal system of the present invention, for the example shown in Fig. 3～7, can be " angeeks " (peace utmost point) by " name " attribute assignment." channel " attribute is for identifying website and then docking with follow-up system.

(2) the start_url node of spider node

The start_url node is for defining start page.Start page can have a plurality of, therefore also just can have a plurality of start_url nodes.In the present embodiment, the start page type is defaulted as " home ", but also can be by for this node increases " type " attribute, specifying its type.When having a plurality of start page, also can distinguish specified type.This node of " url " attribute representation concrete URL pointed of start_url node, in this example, start page only has one, and its page address is " the http://apk.angeeks.com/ " in " url " attribute.

(3) the crawl_rule node of spider node

The crawl_rule node crawls rule for defining wall scroll.Its " from " attribute has been specified the page in which kind of type by this rule application, and " to " attribute specifies the page of the address sensing parsed after this rule of application to belong to which kind of type.The address that this attribute parses while ignoring will be submitted to downloader as download link.For example, in last crawl_rule node of above-mentioned spider node, there do not is " to " attribute, according to this, crawl the regular address parsed and will be regarded as download link, and add in the download queue after the process duplicate removal, this rule is corresponding with step shown in Fig. 7.And in above-mentioned example front four have " to " attribute crawl rule (crawl_rule) respectively with shown in Fig. 3～6 crawl the rule corresponding.

In addition, in other embodiments, also can be using the address of another particular type (page) as download link.For example, can define the type of the address of " download " type as download link, and in " to " attribute assignment " download " to show to crawl according to this rule parsing address out be to be added to the download address of downloading in queue.

In addition, the xpath expression formula crawled in rule (crawl_rule) node can be specified by two kinds of modes: " xpath " attribute and " restrict_xpath " attribute." restrict_xpath " attributes extraction be " a " label in the html file, and the value of " href " attribute that will be somebody's turn to do " a " label is as new address." xpath " attribute directly extracts the address character string.Seen from the above description, the function of " xpath " attribute is better than " restrict_xpath " attribute.But, for backwards compatibility, need to provide " restrict_xpath " attribute to come more compatible early stage websites.When two attributes exist simultaneously, native system will be ignored the restrict_xpath attribute.

" filter_regex " attribute of crawl_rule node has been specified regular expression.This regular expression is for mating the address character string extracted according to " restrict_xpath " or " xpath " attribute.If " filter_regex " attribute exists, program can be mated the address and the corresponding regular expression that go out according to " restrict_xpath " or " xpath " attributes extraction.If it fails to match, skip this address.If the match is successful, all groups (group) that will mate out be stitched together as final address.For example, this parameter of picture " timestamp " occurred in the address character string extracted likely can cause variety of issue when webpage crawls, therefore for for example following network address:

www.domain.com/item？timestamp＝6666&itemid＝8888

Can use and be similar to

(.*/item？)timestamp＝\d*&(itemid＝\d*)

And so on regular expression mated.By the canonical coupling of this form, the part at middle " timestamp " parameter place is removed, only stay the address (character string) of front and " itemid " parameter (character string) needed.And by being stitched together, these character strings obtain final address www.domain.com/item? itemid=8888.Certainly, the number of character string is not limited to 2, in other examples, can be also more than 1 or 3.

When " next_page " of crawl_rule node attribute exists and is " true ", mean that this rule is aforementioned " page turning rule ".Reptile had been used the number of times of this rule when crawling by record, can identify on which page of the current list of application that is linked at this website crawled.In the present embodiment, when design crawls rule, require only just to define this attribute when turning over the list of getting " by sequence update time ".In further embodiments, can also define this attribute for the list under other states, may not under the state of " by sequence update time ", stir list.

Certainly, the title of above these nodes and/or attribute can be according to the needs of realizing system of the present invention and corresponding change is not limited to the given form of above example.

In addition, exist and embed the number of site that JS generates page link.And the rule definition mode that crawls that the xpath expression formula is combined with regular expression of native system provides very large flexibility for processing this type of website.Referring to Fig. 8 and 9, it shows a comparatively simple example of this situation: the extraction of the download link in " peace intelligence market ".

As shown in Figure 8, the download link of the application software in " peace intelligence market " is not the actual network address, but one section JS code, and download address is embedded in this code inside.In this case, can choose the respective nodes in corresponding html file by the xpath expression formula, then the download link wherein with matching regular expressions.

For example, a correspondence to crawl regular example as follows:

<crawl_rule?from＝″topic″restrict_xpath＝″//div[@class＝′detail_down′]/a″filter_regex＝″′(dl_app[^′]*)″′/>

Wherein, " from " attribute representation this crawl the page (that is, information page) that rule is applicable to " topic " attribute; There is no " to " attribute, mean that crawling the regular address parsed according to this is to be added into the download link of downloading queue; " restrict_xpath " attribute representation extracts and comprises as mentioned above the JS code at interior corresponding html node from current page; And " filter_regex " attribute representation is for extracting the regular expression of respective downloaded address from this respective nodes, wherein, in this example, this address has the form of part address.

In fact, the string operation function that utilizes xpath to carry, we can process some more complicated situations.For example, there do not is direct download link on the page of software download site " pea pods ", and the packet during apk type software of by network packet catcher " wireshark ", intercepting download " pea pods " can learn that its download link has following form:

Http:// apps.wandoujia.com/apps/ wraps name/download

So we crawl as shown in Figure 9 rule and extract the bag name and splice download address from webpage below basis:

<crawl_rule?from＝″topic″xpath＝″concat(′http://apps.wandoujia.com/apps/′，//a/@data-packagename，′download′)″/>

Next, the mechanism of heavily climbing used in the system and method for the present invention will be described in detail.Under normal conditions, the network address crawled can be recorded in the duplicate removal list.And enter in the address newly parsed, to crawl queue (download queue) front, all the network address in itself and duplicate removal list can be compared, and the address that abandons repetition with avoid to webpage repeat crawl.Yet the software download site of all normal operations all can provide the content of download to be upgraded to it, and the content of these renewals there will be on former pages of application download list (for example,, in the situation that by sequence update time) usually.Simultaneously, the content of these latest updates also outbalance usually for the user.

But, due to the existence of duplicate removal list, existing reptile program will be when judging the address that will crawl and be the address crawled in heavily climbing list, automatically abandon crawling the page after this renewal.In order to tackle more news of this page (software), we need a set of new heavily mechanism of climbing to guarantee that former pages (for example,, in the situation that by sequence update time) of list of application can be repeated to crawl with certain frequency.

In native system, exist and heavily climb two kinds of relevant mechanism, they are the type based on crawling the page and the number of pages that will crawl page place respectively.

(1) mechanism of heavily climbing based on page type

For dissimilar webpage, for it defines respectively the term of validity of corresponding entry in the duplicate removal list.In this example, because the details page is relation one to one with the software kit that will download usually, so the definition of the webpage term of validity only limits to parsing details page (" topic " page) webpage before usually.For example,, in this example for the page definition term of validity of " home ", " list_ " and " list " type.In another embodiment, also can the term of validity be set for " topic " page.In yet another embodiment, can not be also the other types page setup term of validity.Arranging of the term of validity for the page can arrange according to specific needs.

The mechanism of heavily climbing that should be based on page type mainly comprise following some:

● deposit the duplicate removal list by the zset data type of Redis database.When reptile completes the crawling of certain network address, not only need this network address is recorded in the duplicate removal list, also need the current time (with form of timestamp) in the corresponding z value of the duplicate removal list with zset data type is recorded in the lump.

● the concrete term of validity is recorded in and crawls in rule (crawl_rule) and initial address (start_url).When crawling by certain, rule parsing goes out or obtain one from the initial address list when crawling address, and this is crawled to the term of validity of the term of validity of rule or initial address as this address.Then, the term of validity of this address is compared with record corresponding in the duplicate removal list.If out of date without corresponding record or corresponding record in the duplicate removal list, this address connection together pushed and crawls queue with its term of validity information.

● due to the website than horn of plenty for content, the time that parses network address may be separated by for a long time with the time that crawls this network address, therefore for fear of caused for a long time the term of validity expired and repeat to crawl owing to being separated by, after network address is read from crawl queue and before preparation crawls, should again in the duplicate removal list, compare." again compare " is mainly in order to prevent following situation: due to parse addresses with to crawl between address interval time excessively of a specified duration, cause the term of validity in the duplicate removal list expired, and may occur that undesired the repetition crawls.For example, in crawling queue, exist 100 address P1 to be crawled, P2 ..., P100.In the first round crawls, order crawl P1, P2 ..., P100 webpage pointed, and they are resolved, obtained new webpage to be crawled, for example P101.Suppose take turns crawl and resolve P1, P2 ..., the spent time of P100 is T ₁.Work as T ₁while being greater than the term of validity in the duplicate removal list, due to P1, P2 ..., P100 respective items in the duplicate removal list is out of date, therefore likely do not continue P101 is crawled and resolves, but carry out a new round (second takes turns) crawl and resolve P1, P2 ..., P100.Suppose that this takes turns crawl and resolve P1, P2 ..., the spent time of P100 is T ₂(because variation of the variation such as system load, web page contents etc. causes T ₂with T ₁different).If

The term of validity<the T of the duplicate removal list respective items of P101 ₂the term of validity<the T of the duplicate removal list respective items of<P1～P100 ₁now, the P101 parsed again will be added to and crawl in queue (because the term of validity of its respective items in the duplicate removal list mistake, i.e. T equally ₂be greater than its term of validity).Now, have two identical address P101 that crawl in crawling queue, and probably their priority is identical.In addition, due to T ₂be less than the term of validity of the duplicate removal list respective items of P1～P100, can not repeat to crawl P1～P100.Now, if before crawling, in the duplicate removal list, P101 is not carried out to duplicate removal again, these two the identical addresses that crawl probably are repeated to crawl at short notice, cause the load of targeted website to increase or cause crawling failure.

As shown in table 2 below, it has described the possible disposition that repeatedly crawls (download) request (term of validity of the webpage in the list of supposition duplicate removal is 3000 seconds) for a certain particular address:

Table 2

As shown above, the term of validity based on this webpage, can repeat to crawl (download) with the interval that is not less than 3000 seconds and be positioned at the page on consolidated network address (download address), thereby can obtain the page after renewal.

As those skilled in the art will notice: in fact term of validity attribute depends on initial rule (start_url) and crawls rule (crawl_rule).Due to for most of website, the webpage of same-type is heavily climbed interval can be identical.Therefore in the present embodiment, in order to simplify the definition of the term of validity, usually by type, carry out the specified page term of validity (expire attribute), as shown in following example:

(in the situation that do not use priority attribute, expire=" 0 " meaning can be not expired)

Therefore, can be by following mode for example for for specific website, crawling rule or the initial address definition term of validity:

As shown in top example, initial address for " peace utmost point market ", defining its page term of validity in the duplicate removal list is 5000 seconds, and, for the page of " by update time sequence " that for example parse in " game " master list page, defining its page term of validity in the duplicate removal list is 4000 seconds.That is, for initial address, every minimum 5000 seconds, just may again capture once, and, for first list page of " by update time sequence ", just may again capture once every minimum 4000 seconds.

In addition, in the present embodiment, for the rule that crawls of undefined " expire " attribute, the term of validity of the type of webpage that its term of validity is specified with " to " attribute is identical.And, for the initial address of undefined " expire " attribute, its term of validity is identical with the term of validity of initial address type (being defaulted as " home ").Certainly, for undefined " expire " attribute crawl rule and initial address, also can define other the term of validity.For example, can be for crawling of undefined " expire " attribute is regular and initial address is specified fixing default term of validity etc.

(2) mechanism of heavily climbing based on number of pages

In some cases, comparatively active software download website is reused for other software by the details page address of undercarriage software sometimes when upgrading application, perhaps, for the renewal version of this software, therefore same information page address may be corresponding to different software or the different editions of same software.This just means will be for the details page address definition term of validity, in order to can obtain the software of renewal.Yet, for most of website, can not define the term of validity for details page.In the present embodiment, process in the following ways this situation.

For the website that needs page turning ability traversal applications download list, when resolving its network address to be downloaded, can crawl " next_page " attribute of rule and the access times of respective rule are known to current number of pages by above.The application of considering renewal generally there will be the former pages in list, therefore, as long as the details page parsed the list page from former pages is adopted to a larger term of validity, can process above-mentioned situation.Because the value of this term of validity is usually very large, most of page usually can be extruded former pages after during this period of time.Be not extruded the link of former pages for small part, because interval is very long, so its heavy expense of climbing also can be left in the basket substantially.

In one example, the term of validity that is less than the page of the details without the term of validity (" topic ") parsed 5 list page (" list ") from number of pages can be set to 1000000 seconds (11.57 days).Thereby, for the major applications download site, substantially not existing was separated by more than 11 days also remains on the download address in first 5 pages.

Like this, heavily climb mechanism by above-mentioned two kinds, i.e. " mechanism of heavily climbing based on page type " and " mechanism of heavily climbing based on number of pages ", can, when making the duplicate removal filter support the term of validity, also can be respectively arranged with the effect phase for different page types and different number of pages.

Next, describe in detail according to an embodiment of the invention and crawl/download flow process in connection with Figure 10 and Figure 11.The download flow process shown in flow process and Figure 11 that crawls shown in Figure 10 can reptile and downloader in native system be carried out respectively.In addition, in other embodiments, crawling the download flow process shown in flow process and Figure 11 and also can be carried out by same node shown in Figure 10, this node possesses the function of reptile and downloader simultaneously.

The main feature that crawls/download flow process of native system is described at first on the whole:

● by individual queue used/list socialization in will crawl/downloading process of Redis database, make controller, one or more reptile and one or more downloader can access same set of configuration and realize distributed crawling.

● for each website that will download, all there is the corresponding one row/list (zset realization) of forming a team in the Redis database.This row/list of forming a team for example comprises: the initial address list, crawl queue, download queue, crawl the duplicate removal list, resolve the duplicate removal list, download duplicate removal list and the list of file duplicate removal, wherein, crawl queue and adopt priority as the z value in the zset data type with downloading queue, and all the other lists all adopt timestamp as the z value.

● will crawl/downloading process is separated, thereby allows two flow processs are carried out to scheduling controlling/performance optimization etc. in different ways.

● according to the place number of pages, the priority of download link is carried out to initialization.In addition, for preparing the download link of downloading, adopt the mode of exponential form upgrade its priority and put back at once and download queue.This mechanism can make the software of number of pages forward (newer) be downloaded by the priority with higher.Make from failed download and have certain interval to follow-up reloading, increased the probability that success is downloaded again, and guaranteed download address is set to back the download queue before downloading beginning, and can in downloading process, because of other reasons, not lose (for downloading successful situation, can be fallen by the duplicate removal list filtering when this address is read again).

Next, describe in detail and crawl flow process in conjunction with the method 200 shown in Figure 10.Method 200 starts from step S210, at step S210, controller 110 (or other nodes arbitrarily, for example reptile 130_1～130_N or downloader 140_1～140_M) read all initial addresses (start_url) in corresponding spider node from the configuration file of the website for downloading, and deposit it in initial address list 127_1 in database 120.In this example, the website number that download is 1 (L=1).Certainly in other embodiments, can there be a plurality of websites that will download.In this case, can in the initial address list for different web sites, deposit corresponding one or more initial address in.

At step S220, each reptile module 130_1～130_N (or other nodes arbitrarily, for example controller 110) to fix or variable time interval comes the address in read start address list 127_1, and it is deposited in and crawls queue 121_1 and (due to L=1 in this example, therefore only exist one to crawl queue 121_1; In addition, simple for what describe, do not divided into groups or only they are divided into to 1 group in the website that will download).Deposit the address in initial address list 127_1 in crawl queue 121_1 concrete mode and can depend on implementation and difference.For example, can by preassigned single reptile, (for example, 130_1) address in initial address list 127_1 be deposited in and crawl queue 121_1; Also can for example, by preassigned a plurality of reptiles (, 130_1 and 130_2) address in initial address list 127_1 be deposited in and crawl queue 121_1, wherein, each reptile be responsible for the address of half; And the standard that also can be specified in advance according to other by a plurality of reptile 130_1～130_N or generate in real time deposits the address in initial address list 127_1 to crawl queue 121_1 in.The invention is not restricted to above-mentioned any logging mode.

In addition, in this example, adopt and similarly heavyly climb machine-processed mode and control and read interval (for example, while, reading in initial address list 127_1 logging timestamp).Certainly, also can control and read interval by other modes.

At step S230, each reptile module 130_1～130_N reads record from crawl queue 121_1, and the priority while reading, for the negative degree of depth that crawls, crawls the darker network address priority of the degree of depth lower.Therefore, reptile module 130_1～130_N crawls the more shallow network address of the degree of depth by priority treatment.Certainly, the arranging of priority depended on the needs that system realizes fully, can be set to different numerical value according to different demand priority.In another embodiment, can priority treatment crawl the darker network address of the degree of depth.

At step S240, reptile module 130_1～130_N is used the record duplicate removal of reptile duplicate removal list 123_1 to reading.That is, if find to repeat to crawl, directly abandon this address to be crawled, and method 200 is returned to step S230 and is processed and crawl the trailer record in queue 121_1.Otherwise, continue execution step S250.

At step S250, reptile module 130_1～130_N is from this address of the Internet download page pointed, if failed download abandons this address to be crawled, and returns to step S230 and processes and to crawl the trailer record in queue 121_1.Otherwise, continue execution step S260.

At step S260, reptile module 130_1～130_N is used the current address (successfully downloaded with this and crawled the corresponding page in address) that crawls to upgrade reptile duplicate removal list 123_1, and the corresponding page is sent into resolver resolves by this address.This resolver can be the resolver had separately in each reptile module 130_1～130_N, can be also the single resolver in controller 110 or other nodes.In addition, this resolver can be also the third party's resolver outside distributed system 10.

At step S270, webpage or application program are pointed in the address that judgement resolver resolves needs out continue to crawl, when determining the sensing webpage, and execution step S280, otherwise execution step S290.

At step S280, the address that will need to continue to crawl after reptile duplicate removal list 123_1 duplicate removal again deposits in and crawls queue 121_1, and returns to step S230.

At step S290, the download address of application program out for resolver resolves, used and resolve duplicate removal list 124_1 duplicate removal.If the discovery repeated resolution, directly abandon this address, and method 200 is returned to step S230 and is processed and to crawl the trailer record in queue 121_1.Otherwise, continue execution step S295.

At step S295, upgrade and resolve duplicate removal list 124_1 with this address, and this address is deposited in and downloads queue 122_1.Depositing the priority of corresponding entry while downloading queue 122_1 in can be: for the page that has number of pages information (, use the page of page turning rule to resolve), priority be negative number of pages-100.For the page that does not have number of pages information (that is, not using the page of page turning rule to resolve), priority is the negative degree of depth-200 that crawl.Certainly, the arranging of priority depended on the needs that system realizes fully, can be set to different numerical value according to different demand priority.In another embodiment, can the priority treatment number of pages larger network address or crawl the darker network address of the degree of depth.

At step S297, whether judgement crawls queue 121_1 is empty, empty if, and method 200 finishes, otherwise method is returned to step S230.In addition, the termination condition of method 200 is not limited to this, also can predeterminedly crawl the degree of depth or the combination of these two comes determination methods 200 whether to finish according to whether arriving.

By the flow process that crawls shown in Figure 10, can the download link comprised in the webpage of appointed website be extracted and download in queue 122_1 by distributed mode, separate fully and be independent of each other with the processing of subsequent download flow process, thereby facilitated to crawl and apply to download for the page, carrying out different optimization/configurations/processing.

Next, describe the download flow process in detail in conjunction with the method 300 shown in Figure 11.Method 300 starts from step S310, and at step S310, download module 140_1～140_M, according to the priority arranged in the step S295 of method 200, reads address to be downloaded from download queue 122_1.

At step S320, use download duplicate removal list 125_1 to carry out duplicate removal to the address to be downloaded of reading.That is, if find repeated downloads, directly abandon this address to be downloaded, and method 300 is returned to step S310 to process the trailer record of downloading in queue 122_1.Otherwise, continue execution step S330.

At step S330, by the priority of this download address * 2.If priority is greater than-10000, put back and download queue 122_1, otherwise think that this address can't download again, abandon.The Main Function of this step S330 is to judge whether this address has been carried out the download trial of too many time.When because this download address number of attempt is too much, and cause priority lower than-10000 o'clock, should judge that this address is invalid and abandon this address.In this case, method 300 is returned to step S310 to process trailer record.Otherwise, continue execution step S340.In addition, priority threshold value might not be-10000, can be also other predetermined values that meet system requirements.

At step S340, the file that download module 140_1～140_M points to by this address of the Internet download.If failed download is skipped this address to be downloaded, and return to step S310 and process the trailer record of downloading in queue 122_1.Otherwise, continue execution step S350.

At step S350, in the situation that the download file success carrys out more newly downloaded duplicate removal list 125_1 by this download address.

At step S360, the file of having downloaded is carried out to various inspections, for example: whether it is the file of target type, whether file damages etc.

At step S370, calculate digital digest (for example, md5, sha1 etc.) for the file of having downloaded, and check file duplicate removal list 126_1 with this digital digest.There is the item mated with the digital digest calculated in finding file duplicate removal list 126_1, mean that this document was downloaded at other network address places, and therefore can abandon this document, and directly return to step S310.Otherwise, continue execution step S380.

At step S380, carry out transaction file duplicate removal list 126_1 with the file of having downloaded, and deposit it in file system.Particularly, can carry out transaction file duplicate removal list 126_1 for identifier or feature that this document and alternative document are distinguished with the digital digest of this document or other.

At step S390, whether queue 122_1 is downloaded in judgement is empty, empty if, and method 300 finishes, otherwise method is returned to step S310.

Thereby, by the method for down loading shown in Figure 11, can realize the distributed downloads for specific type of data (in this example, application program), can control duplicate removal process and repeated downloads process by each duplicate removal list and the relevant term of validity as above simultaneously.

Next will native system be described in the situation that do not adopt the log system of building separately how the content to the daily record data that crawls/download relevant daily record data and provide is provided.

As mentioned above, all queue/lists in native system/configuration etc. all is kept in public (Redis) database 120.And the modules of native system (controller 110, reptile 130_1～130_N, downloader 140_1～140_M) is all undertaken by database 120 alternately.By the state of each list/queue in staqtistical data base 120, can roughly grasp on the whole the state of native system.

For example, for example, due to the subsidiary timestamp (, the z value) of each list, by calculating each duplicate removal list, drop on the quantity that records in special time period, can obtain the crawl/download situation in this time period.

In one embodiment, example statistical (take peace utmost point market crawling during the time period 2013/03/12～2013/03/13/downloading data is example) as follows:

Current queue length=1 that crawls

Reptile duplicate removal list total length=65344

Resolve duplicate removal list total length=60445

Download duplicate removal list total length=60444

File duplicate removal list total length=60286

Reptile duplicate removal list length in the fixed time section=831

Resolve duplicate removal list length=16 in the fixed time section

Download duplicate removal list length=16 in the fixed time section

File duplicate removal list length in the fixed time section=16

Thereby, can secondhand system statistics data at least comprise the following:

The quantity of address to be crawled=current queue length=1 that crawls

The download address quantity of the quantity of address to be downloaded+be dropped=parsing duplicate removal list total length-download duplicate removal list total length=1

The page sum crawled=reptile duplicate removal list total length=65344

The total download link number parsed=parsing duplicate removal list total length=60445

Download link sum=download duplicate removal list total length that success has been downloaded=60445

The total number of files successfully crawled=file duplicate removal list total length=60286

Resolve duplicate removal list length=16 in the download link number parsed in the fixed time section=fixed time section

Download duplicate removal list length=16 in the download link that in the fixed time section, success is downloaded=fixed time section

File duplicate removal list length in the total number of files successfully crawled in the fixed time section=fixed time section=16

In addition, when the concluding time of fixed time section is the current time:

Reptile duplicate removal list length in the page quantity that in the fixed time section, reptile crawls=fixed time section=831

Therefore, as mentioned above, can be by the statistical information of each queue/list of preserving in database 120, what obtain native system crawls/downloads ASSOCIATE STATISTICS information.Thereby avoided building independent log system.

So far invention has been described in conjunction with the preferred embodiments.Should be appreciated that, those skilled in the art without departing from the spirit and scope of the present invention, can carry out various other change, replacement and interpolations.Therefore, scope of the present invention is not limited to above-mentioned specific embodiment, and should be limited by claims.

Claims

1. the distributed method for the download site data comprises the following steps:

A) the one or more addresses in the read start address list, and described one or more addresses are deposited in a plurality of download queues of safeguarding in database at least one download queue, wherein, described a plurality of download queue at least comprises the first download queue and the second download queue, and each downloads queue corresponding to dissimilar data;

B) download respectively the data of a plurality of types according to described a plurality of download queues;

C) parse the subsequent address that will download from the data of downloading according to described the first download queue;

D) judge which described subsequent address belongs to and download queue;

E) according to steps d) in judged result, described subsequent address is deposited in the respective downloaded queue; And

F) repeat above-mentioned steps b) to e), until meet predetermined condition.

2. method according to claim 1, wherein, step b) comprising:

According to the download interval for each website of centralized maintenance in described database, according to the record in the respective downloaded queue, from corresponding website downloading data.

3. method according to claim 1, wherein, described first to download queue be the page download queue, and described second to download queue be that queue is downloaded in application.

4. method according to claim 1, wherein, the website that will download for each, predefine by a plurality of download modules unified that use and for the different page types of described website separately regulation one or more crawl rule.

5. method according to claim 4, wherein, steps d) comprising:

When crawling rule and comprising the first attribute of the type of indicating described subsequent address, according to described the first attribute, judge which described subsequent address belongs to and download queue when described.

6. method according to claim 5, wherein, steps d) also comprise:

While not comprising the first attribute of the type of indicating described subsequent address, judge that described subsequent address belongs to the second download queue when the described rule that crawls.

7. method according to claim 4, wherein, step e) comprising:

Further judge whether described subsequent address meets by the described defined regular expression of the second attribute crawled in rule, if meet described regular expression, will be spliced into the address that will deposit in the respective downloaded queue as each character string of the result of described subsequent address and described matching regular expressions, otherwise skip described subsequent address.

8. method according to claim 4, wherein, step c) comprising:

According to described the 3rd attribute that crawls in rule the character string constitutive characteristic for describing subsequent address, from the data of downloading, parse the subsequent address that will download.

9. method according to claim 8, wherein, described the 3rd attribute is the xpath expression formula.

10. method according to claim 4, wherein, step b) comprising:

Read a record from the first download queue;

The first duplicate removal list that use is safeguarded in described database, carry out duplicate removal to described record; And

Download corresponding data according to described record.

11. method according to claim 10, wherein, the step of described record being carried out to duplicate removal comprises:

During record matching in determining described record and described the first duplicate removal list, skip described record and read next record from described the first download queue.

12. method according to claim 10, wherein, at step e) before, described method is further comprising the steps of:

Upgrade described the first duplicate removal list with the address in described record; And step e) comprising:

Download queue in the situation that described subsequent address belongs to described first, use described the first duplicate removal list after upgrading to described subsequent address duplicate removal, and deposit it in described first and download queue; And

Download queue in the situation that described subsequent address belongs to described second, use the second duplicate removal list of safeguarding in described database to described subsequent address duplicate removal, described the second duplicate removal list of corresponding renewal, and deposit it in described the second download queue.

13. method according to claim 12, wherein, step b) comprising:

According to priority from described second, download queue and read address to be downloaded;

The 3rd duplicate removal list that use is safeguarded in described database is carried out duplicate removal to described address to be downloaded;

The data of being pointed to by described address to be downloaded are downloaded; And

Upgrade described the 3rd duplicate removal list.

14. method according to claim 13, wherein, step b) also comprise:

The data of having downloaded are checked;

Calculate the digital digest of the data of having downloaded;

Based on described digital digest, use the 4th duplicate removal list of safeguarding in described database to carry out duplicate removal to the data of having downloaded;

Upgrade described the 4th duplicate removal list; And

The described data of having downloaded are deposited in file system.

15. method according to claim 14, wherein, every record in described the first duplicate removal list, described the second duplicate removal list and described the 3rd duplicate removal list includes the term of validity, when using each duplicate removal list appropriate address to be carried out to duplicate removal and exist with the recording of appropriate address coupling in the duplicate removal list, said method comprising the steps of:

16. method according to claim 15, wherein, the described rule that crawls also comprises the 4th attribute, when the value of described the 4th attribute is "True", by the record to comprising the described number of times that crawls regular use of described the 4th attribute, identify current download the list of application that is linked at corresponding website which the page on.

17. method according to claim 16, wherein, when upgrading described the first duplicate removal list, described the second duplicate removal list or described the 3rd duplicate removal list, in the situation that on the page of the predetermined number of pages that is linked at list of application of current download, with in situation on the page after the predetermined number of pages that is linked at list of application of current download, compare, in the duplicate removal list, the term of validity of corresponding record is set to larger.

18. method according to claim 17, wherein, described predetermined number of pages is less than or equal to 5 pages.

19., according to the described method of any one in claim 1～18, wherein, by described database, provide to website data and download relevant statistical information.

20. method according to claim 19, wherein, described statistical information comprises following one or more: the length of each duplicate removal list or the total length of each duplicate removal list in the current length of described a plurality of download queues, fixed time section.

21. method according to claim 1, wherein, described database is key-value type database.

22. method according to claim 21, wherein, described database is the Redis database.

23. method according to claim 1, wherein, described predetermined condition is that all download queues are sky, reach and be scheduled to crawl the degree of depth or the combination of the two.

24. the distributed system for the download site data comprises:

Controller, the one or more addresses in the read start address list, and described one or more addresses are deposited in at least one the download queue in a plurality of download queues of safeguarding in database;

Database, for safeguarding a plurality of download queues, wherein, described a plurality of download queues at least comprise the first download queue and the second download queue, and each downloads queue corresponding to dissimilar data;

A plurality of download modules, for download respectively the data of a plurality of types according to described a plurality of download queues, parse the subsequent address that will download from the data of downloading according to described the first download queue, judge which described subsequent address belongs to and download queue, according to the result of described judgement, described subsequent address is deposited in the respective downloaded queue; And the repetition said process, until meet predetermined condition.

25. distributed system according to claim 24, wherein, described a plurality of download modules are further configured to:

26. distributed system according to claim 24, wherein, described the first download queue is the page download queue, and described the second download queue is that queue is downloaded in application.

27. distributed system according to claim 24, wherein, the website that will download for each, predefine by a plurality of download modules unified that use and for the different page types of described website separately regulation one or more crawl rule.

28. distributed system according to claim 27, wherein, described download module is configured to:

29. distributed system according to claim 28, wherein, described download module also is configured to:

30. distributed system according to claim 27, wherein, described download module is configured to:

31. distributed system according to claim 27, wherein, described download module is configured to:

32. distributed system according to claim 31, wherein, described the 3rd attribute is the xpath expression formula.

33. distributed system according to claim 27, wherein, described download module is configured to:

Read a record from the first download queue;

Download corresponding data according to described record.

34. distributed system according to claim 33, wherein, described download module also is configured to:

35. distributed system according to claim 33, wherein, described download module is configured to:

Upgrade described the first duplicate removal list with the address in described record;

36. distributed system according to claim 35, wherein, described download module also is configured to:

Upgrade described the 3rd duplicate removal list.

37. distributed system according to claim 36, wherein, described download module also is configured to:

The data of having downloaded are checked;

Calculate the digital digest of the data of having downloaded;

Upgrade described the 4th duplicate removal list; And

The described data of having downloaded are deposited in file system.

38. according to the described distributed system of claim 37, wherein, every record in described the first duplicate removal list, described the second duplicate removal list and described the 3rd duplicate removal list includes the term of validity, when using each duplicate removal list appropriate address to be carried out to duplicate removal and exist with the recording of appropriate address coupling in the duplicate removal list, described download module also is configured to:

39. according to the described distributed system of claim 38, wherein, the described rule that crawls also comprises the 4th attribute, when the value of described the 4th attribute is "True", by the record to comprising the described number of times that crawls regular use of described the 4th attribute, identify current download the list of application that is linked at corresponding website which the page on.

40. according to the described distributed system of claim 39, wherein, when upgrading described the first duplicate removal list, described the second duplicate removal list or described the 3rd duplicate removal list, in the situation that on the page of the predetermined number of pages that is linked at list of application of current download, with in situation on the page after the predetermined number of pages that is linked at list of application of current download, compare, in the list of described download module duplicate removal, the term of validity of corresponding record is set to larger.

41., according to the described distributed system of claim 40, wherein, described predetermined number of pages is less than or equal to 5 pages.

42., according to the described distributed system of any one in claim 24～41, wherein, by described database, provide to website data and download relevant statistical information.

43., according to the described distributed system of claim 42, wherein, described statistical information comprises following one or more: the length of each duplicate removal list or the total length of each duplicate removal list in the current length of described a plurality of download queues, fixed time section.

44. distributed system according to claim 24, wherein, described database is key-value type database.

45., according to the described distributed system of claim 44, wherein, described database is the Redis database.

46. distributed system according to claim 24, wherein, described predetermined condition is that all download queues are sky, reach and be scheduled to crawl the degree of depth or the combination of the two.