CN103246675B

CN103246675B - A kind of method and apparatus for being used to capture website data

Info

Publication number: CN103246675B
Application number: CN201210030588.6A
Authority: CN
Inventors: 江军; 余庆生
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-02-10
Filing date: 2012-02-10
Publication date: 2018-01-12
Anticipated expiration: 2032-02-10
Also published as: CN103246675A

Abstract

It is an object of the invention to provide a kind of method and apparatus for being used to capture website data.First, according to the Website topology information, link is not accessed by whole link selections one in the current root page, and obtain the next layer of page of its sensing；Then, according to the first pre-defined rule, judge whether the next layer of page is the target information page；When the next layer of page is not the target information page, then using the next layer of page as the current root page, the step a and b is repeated, until meeting the first predetermined condition；When judging the next layer of page for the target information page, the target information page is captured；When meeting the second predetermined condition, using previous root page face as the current root page, described step a, b, c1 and c2 are repeated.Compared with prior art, the present invention realizes the target data for capturing whole website by the way of depth-first traversal, ensure that the accuracy to target data crawl, improves the efficiency of data grabber.

Description

A kind of method and apparatus for being used to capture website data

Technical field

The present invention relates to Internet technical field, more particularly to a kind of technology for being used to capture website data.

Background technology

In the prior art, website crawl data are provided from data to generally require for one pin of the independent execution in each website This, but when data offer Websites quantity is more, it is necessary to safeguard set crawl scripts, therefore script maintenance cost is higher more, number It is inefficient according to capturing；Meanwhile after data provide website setting classification information, can there be last time in its server end and set The cookie information of classification information is put, but because traditional data crawl typically uses the Grasp Modes of breadth First, and same When classification information is changed in the page, the URL (URL) of the page link will not change so that access same one page After each classification information link in face, the data that may be grabbed are the classification of the last time selection recorded in cookie information Information, rather than target data corresponding to each classification information of desired crawl, the accuracy of data grabber be not high.

Therefore, effective crawl of website data how is realized, turns into one of current urgent problem to be solved.

The content of the invention

It is an object of the invention to provide a kind of method and apparatus for being used to capture website data.

According to an aspect of the invention, there is provided a kind of computer implemented method for capturing website data, should Method comprises the following steps：

A does not access link according to the Website topology information, by whole link selections one in the current root page, And obtain the next layer of page of its sensing；

B judges whether the next layer of page is the target information page according to the first pre-defined rule；

C1 is not the target information page when the next layer of page, then using the next layer of page as the current root page, The step a and b is repeated, until meeting the first predetermined condition；

C2 captures the target information page when judging the next layer of page for the target information page；

Wherein, this method also includes：

- when meeting the second predetermined condition, using previous root page face as the current root page, repeat described step a, b, c1 And c2.

According to another aspect of the present invention, a kind of equipment for capturing website data is additionally provided, the equipment includes：

First acquisition device, for according to the Website topology information, by whole chain selectings in the current root page Select one and do not access link, and obtain the next layer of page of its sensing；

Judgment means, for according to the first pre-defined rule, judging whether the next layer of page is the target information page；

First circulation device, the next layer of page is judged not for the target information page for working as, then by next layer The page repeats the operation of first acquisition device and the judgment means, until meeting first as the current root page Predetermined condition；

Wherein, the equipment also includes：

First grabbing device, for when judging the next layer of page for the target information page, capturing the target letter Cease the page；

Wherein, the equipment also includes：

Second circulation device, for when meeting the second predetermined condition, being held previous root page face as the current root page, repetition The operation of row first acquisition device, the judgment means, the first circulation judgment means and first grabbing device.

Compared with prior art, the Website topology information of present invention basis Data web site to be captured, it is excellent using depth The mode first traveled through, the target data for capturing whole website is realized, so as to reduce the maintenance cost of more script data crawls, And the accuracy to target data crawl is ensure that, improve the efficiency of data grabber.

Brief description of the drawings

By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other Feature, objects and advantages will become more apparent upon：

Fig. 1 shows the equipment schematic diagram for being used to capture website data according to one aspect of the invention；

Fig. 2 shows the exemplary plot for being used to capture website data in accordance with a preferred embodiment of the present invention；

Fig. 3 goes out the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention；

Fig. 4 goes out the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention；

Fig. 5 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention；

Fig. 6 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention；

Fig. 7 shows the method flow diagram for being used to capture website data according to a further aspect of the present invention；

Fig. 8 goes out the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention；

Fig. 9 goes out the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention；

Figure 10 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention；

Figure 11 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention.

Same or analogous reference represents same or analogous part in accompanying drawing.

Embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

Fig. 1 shows the equipment schematic diagram for being used to capture website data according to one aspect of the invention.Wherein, capture apparatus 1 includes the first acquisition device 111, judgment means 112, first circulation device 113, the first grabbing device 114 and second circulation dress Put 115.

Here, capture apparatus 1 is the network equipment, it includes but is not limited to computer, network host, single network service The cloud that device, multiple webserver collection or multiple servers are formed.Here, cloud is by based on cloud computing (Cloud Computing) A large amount of computers or the webserver form, wherein, cloud computing is one kind of Distributed Calculation, by the meter of a group loose couplings One super virtual computer of calculation machine collection composition.

Communicated here, can be realized between capture apparatus 1 and the network equipment of website by any communication mode, including but not Be limited to, the mobile communication based on 3GPP, LTE, WIMAX, based on TCP/IP, udp protocol computer network communication and be based on The low coverage wireless transmission method of bluetooth, Infrared Transmission standard.

It is described in detail referring to Fig. 1 to capture the process of the target information page to capture apparatus 1：

First, the first acquisition device 111 is linked according to the Website topology information by the whole in the current root page Selection one does not access link, and obtains the next layer of page of its sensing.

Here, the Website topology information includes but is not limited to following any one：

1) URL (URL) of the first floor page of Data web site to be captured；

2) number of plies information of Data web site to be captured, the i.e. first floor page (first layer) to the target information page (last layer) The number of plies；

3) the matching characteristic information of link is included in every layer of page of Data web site to be captured；

4) page identification information of every layer of page of Data web site to be captured；Wherein, the page identification information can be located at page The customized label of the making language document in face, annotation etc.；

Here, the making language document includes but is not limited to：

A) HTML (HTML) file；

B) extensible HyperText Markup Language (XHTML) file；

C) extensible markup language (XML) file.

Specifically, when the first acquisition device 111 accesses when Data web site is captured first, first, the first acquisition device 111 According to the first floor page URL shown in the Website topology information of the website, by predetermined communication mode, as http, The communication protocols such as https, to the network equipment of the website, such as webserver, first floor page access request is sent, and receive and be somebody's turn to do The first floor page that the network equipment returns；Then, the first acquisition device 111 is using the first floor page of the website as current page, and The whole links included in the current page are extracted, and selects one therefrom and does not access link；Wherein, due to capture apparatus 1 The website is accessed first, therefore whole links in the current page are not access link；Then, the first acquisition device 111 Link is not accessed according to selected this, by predetermined communication mode, is sent this to the webserver and is not accessed link and refer to To next layer of page access request of the next layer of page, and the next layer of page of network equipment return is received, while visited Ask in list by point to the next layer of page this do not access chained record to have accessed link, it is next to have accessed this for mark The layer page.

In one example, first, in Website topology information of first acquisition device 111 according to Data web site to be captured The first floor page URL, such as http://d.cn, the page acquisition that the URL sensing pages are sent to the webserver of the website please Ask, and receive the first floor page of webserver return, and as current page A；Then, the current page is parsed A making language document, whole link a1, the a2 included in current page A are therefrom extracted, and be randomly chosen link a1； Then, the first acquisition device 111, by predetermined communication mode, sends link a1 to the webserver and referred to according to link a1 To next layer of page B next layer of page access request, and next layer of page B of network equipment return is received, while Record points to page B link a1 in access list, and the next layer of page B has been accessed for identifying.

Those skilled in the art will be understood that the mode of the next layer of page of above-mentioned acquisition is only for example, and other are existing or modern The mode for the next layer of page of acquisition being likely to occur afterwards is such as applicable to the present invention, also should be included in the scope of the present invention with It is interior, and be incorporated herein by reference.

Next, it is determined that device 112 judges the next layer of page that the first acquisition device 111 obtains according to the first pre-defined rule Whether it is the target information page.

Here, the judgment mode according to the first pre-defined rule includes but is not limited to：

- by the page of the page identification information in the making language document of the next layer of page and the predetermined target information page Type is compared, to judge whether the next layer of page is the target information page.

In one example, judgment means 112 extract the making language document of the next layer of page obtained, such as HTML texts Part, and the html file is parsed, so as to be read from the precalculated position of the html file to obtain annotation information：<！-TYPE 3-- >, the annotation information is consistent with the page type of the target information page predetermined in Website topology information, then judges under this One layer of page is the target information page.

Those skilled in the art will be understood that the mode of the above-mentioned judgement target information page is only for example, other it is existing or The mode for the judgement target information page being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention Within, and be incorporated herein by reference.

Then, when judgment means 112 judge the next layer of page for the target information page, the first grabbing device 114 Capture the target information page.

Here, the mode of the crawl includes but is not limited to following any one：

1) making language document of the target information page and whole associated script files are captured, such as CSS, JavaScript Deng；

2) text message, pictorial information and the download link in the target information page are captured.

In one example, when the next layer of page is the target information page, the first grabbing device 114 parses the target information The html file of the page and whole associated script files, extract text message in the target information page, pictorial information and under Link information is carried, and those information are stored in the data storage storehouse of capture apparatus 1；Here, the database is included but not It is limited to relational database, Key-Value storage systems or file system etc..

Those skilled in the art will be understood that the mode of the above-mentioned crawl target information page is only for example, other it is existing or The mode for the crawl target information page being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention Within, and be incorporated herein by reference.

Meanwhile when judgment means 112 judge the next layer of page not for the target information page, then first circulation device 113 using the next layer of page as the current root page, repeats the first acquisition device 111 and the operation of judgment means 112, Until meet the first predetermined condition.

Here, first predetermined condition includes：

1) the current root page is without the next layer of page；

2) repeat the first acquisition device 111 and the number of the operation of judgment means 112 exceedes pre-determined number.

Specifically, when judging the next layer of page not for the target information page, then first circulation device 113 is by the next layer of page Face repeats the first acquisition device 111 and the operation of judgment means 112, i.e. first, according to website as the current root page Website topology information, do not access link from whole link selections one in the current root page, and obtain what it was pointed to The next layer of page；Then, according to the first pre-defined rule, judge whether the next layer of page is the target information page；When judgement should When the next layer of page is the target information page, the target information page is captured；When the next layer of page is not target information page Face, then using the next layer of page as the current root page, and first circulation device 113 repeats aforesaid operations, until meeting First predetermined condition.

In one example, as shown in Fig. 2 when the first floor page of website is A, wherein comprising not accessing link a1 and a2, with Select one not access link a1, the next layer of page B that link a1 is pointed to machine, and judge page B not for target information page During face；Then first circulation device 113 is using page B as the current root page, and does not access link b1 and b2 from the whole in page B In, it is randomly chosen one and does not access links of the link b1 as next layer of access, so as to which according to link b1, b1 is linked to obtain Next layer of page C is pointed to, while record points to page C link b1 in access list, for identifying accession page C； Then, according to the first pre-defined rule, judge whether next layer of page C is the target information page, when judging C not for target information page During face, then using C as the current root page, do not accessed from C whole and chains of the c1 as next layer of access is selected in link c1 and c2 Connect, point to next layer of page D so as to obtain c1, while record points to page D link c1 in access list；Then first Pre-defined rule, judge whether next layer of page D is the target information page；If judging, D for the target information page, captures D；If sentence The next layer of page D that break is not the target information page, then using page D as the current root page, and page D is last layer of website The page, i.e. the current root page without the next layer of page, meet the first predetermined condition, then first circulation device 113 stops above-mentioned heavy Multiple operation.

Those skilled in the art will be understood that the above-mentioned mode for repeating operation is only for example, and other are existing or from now on The mode for repeating operation being likely to occur such as is applicable to the present invention, should also be included within the scope of the present invention, and It is incorporated herein by reference.

When meeting the second predetermined condition, second circulation device 115 repeats using previous root page face as the current root page The operation of first acquisition device 111, judgment means 112, the grabbing device 113 of first circulation judgment means 114 and first.

Here, second predetermined condition includes following any one：

1) whole links in the current root page have accessed；

2) link of predetermined number has accessed in the current root page.

In one example, as shown in Fig. 2 connecting example, when the current root page is page D, due to current root page D without The next layer of page, that is, meet that whole links in the current root page in the second predetermined condition have accessed, then second circulation fills 115 are put using page D previous root page face C as the current root page, so that c1 and c2 are linked according to the whole of current root page C, Matching inquiry is carried out in access list, it is determined that and select do not access link c2 as next layer access link, with obtain C2 points to next layer of page E, then judges whether next layer of page E is the target information page, while remember in access list Page E link c2 is pointed in record；Subsequently, as whole link c1 and c2 in current root page C have been accessed, that is, meet second Predetermined condition, then second circulation device 115 is using page C previous root page face B as the current root page, and according to access list In the link of access that shows, all linked from current root page B in b1 and b2 and select not access link b2, as next layer The link of access, so as to according to link b2, point to next layer of page F to obtain link b2, while recorded in access list F link b2 is pointed to, next layer of page F has been accessed for identifying；Next, it is determined that whether next layer of page F is target information page Face；If judging, F for the target information page, captures F；If F is judged not for the target information page, and according to the Website topology Information understands that F is last layer, while understands that whole link b1 and b2 in current root page B have been visited according to access list Ask, that is, meet the second predetermined condition, then using page B previous root page face A as the current root page, show according in access list The link of access gone out, all linked from current root page A in a1 and a2 and select not access link a2, as next layer of access Link, so as to according to link a2, with obtain link a2 point to next layer of page G；Next, it is determined that whether G is target information page Face, when judging G for the target information page, then capture target information page G.

Fig. 3 shows the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, grab Taking equipment 1 also includes the second grabbing device 316；Second grabbing device 316 is according to the second pre-defined rule, by the first grabbing device Determined in whole links of the target information page of 314 crawls and capture target download link.

Here, the function of device 311 shown in Fig. 3,312,313,314 and 315 and the above device described by reference picture 1 111st, 112,113,114 is identical with 115 content, for simplicity, it is incorporated herein by reference, without repeating.

Determine that the process of crawl target download link is described in detail to capture apparatus 1 referring to Fig. 3：

Here, the mode for being determined according to the second pre-defined rule and capturing target download link includes but is not limited to：

- according to the URL linked in the target information page, by way of Keywords matching, to determine and capture under target Carry link.

In one example, when the target information page is H, whole link h1 in the second grabbing device 316 extraction H and H2, then, according to link h1 and h2 URL, string matching is carried out with predetermined keyword " .jar ", so that it is determined that link h1 In include the keyword, it is determined that h1 is target download link, and then captures target download link h1 text message and URL.

Those skilled in the art will be understood that the mode of above-mentioned determination target download link is only for example, other it is existing or The mode for being likely to occur the download link that sets the goal really from now on is such as applicable to the present invention, should also be included in the scope of the present invention Within, and be incorporated herein by reference.

Fig. 4 shows the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, Two grabbing devices 416 include the link placement unit 4162 of determining unit 4161 and the 3rd.Determining unit 4161 is linked according to second Pre-defined rule, download chain to be determined is determined in the whole links included from the target information page of the first grabbing device 414 crawl Connect；3rd placement unit 4162 downloading data bag according to corresponding to the download link to be determined that link determining unit 4161 determines, By being determined in such download link to be determined and capturing target download link.

Here, the function of device 411 shown in Fig. 4,412,413,414 and 415 and the above device described by reference picture 3 311st, 312,313,314 is identical with 315 content, for simplicity, it is incorporated herein by reference, without repeating.

Download link to be determined and determination are determined to capture apparatus 1 and capture target download link referring to Fig. 4 Process is described in detail：

In one example, when the whole that the target information page includes is linked as h1, h2 and h3, link determining unit 4161 According to the URL of this three links, string matching is carried out respectively with predetermined keyword " .sis " and " .jar ", so that it is determined that h1 Comprising keyword " .jar " is included in keyword " .sis " and h3 URL in URL, that is, determine that h1 and h2 is download chain to be determined Connect；Then, the 3rd placement unit 4162 obtains downloading data corresponding to this two links according to download link h1 and h2 to be determined Bag, and read the header files of two downloading data bags to judge whether it is binary data packets, when judging download link h1 When corresponding downloading data bag is binary data packets, it is determined that download link h1 is target download link, so as to capture under this Link h1 text message and URL are carried, and is stored in the data repository of capture apparatus 1.

Those skilled in the art will be understood that the mode of above-mentioned determination download link to be determined and/or determination and capture target The mode of download link is only for example, the mode of other determination that is existing or being likely to occur from now on download links to be determined and/ Or determine and capture the mode of target download link to be such as applicable to the present invention, it should also be included within the scope of the present invention, And it is incorporated herein by reference.

Fig. 5 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention.Wherein, One acquisition device 511 includes the acquiring unit 5112 of second acquisition unit 5111 and the 3rd.Second acquisition unit 5111 is according to described Website topology information and predtermined category list, inquired about in whole links in the current root page, to obtain Match and link with the one or more that the classification in the predtermined category list matches；3rd acquiring unit 5112 is obtained by second Selection one does not access link during the one or more matchings for taking unit 5111 to obtain link, and obtains next layer of page of its sensing Face.

Here, the function of device 512 shown in Fig. 5,513,514 and 515 and the above device 112 described by reference picture 1, 113rd, 114 is identical with 115 content, for simplicity, it is incorporated herein by reference, without repeating.

Retouched in detail referring to Fig. 5 to obtain matching link to capture apparatus 1 and obtain the process of the next layer of page State：

Here, the target information page can be belonging respectively to different classification；For example, capture apparatus 1 for particular brand, When the mobile terminal crawl target information page and target download link of specific model, then show the need for grabbing in predtermined category list The particular brand and the class indication of specific model taken.

In one example, first, when wait capture Data web site for mobile terminal application website is provided when, then second acquisition Unit 5111 extracts whole link a1, a2, a3 and the a4 included in current root page B, and according to the Website topology of website The matching characteristic information of link URL is included in the page of layer where the current root page shown in information：

http://c.d.cn/wml/eqp/index/From=.* $,

The URL that a1, a2, a3 and a4 are linked with the whole is matched respectively, to determine and the matching characteristic information phase That matches somebody with somebody is linked as a1, a3 and a4, and the text message for extracting link a1 Anchor Text is " Nokia N8 applications ", links a3 anchor The text message of text is " Nokia E7 applications " and the text message of link a4 Anchor Text is " LG 6660 is applied "；Then, Second acquisition unit 5111 is identified as " LG according to the brand class indication and corresponding type classification that are shown in predtermined category list 6660 " and " Nokia E7 ", string matching is carried out respectively with being linked as a1, a3 and a4 text message of Anchor Text, obtain Be linked as a3 and a4 with the brand class indication and the matching that matches of type classification mark；Then, the 3rd list is obtained Member 5112 is linked by matching and selects not access link a4 in a3 and a4 according to access list, and to the website service of the website Device sends acquisition link a4 and points to next layer of page access request of the next layer of page, and receives under Website server return One layer of page.

Here, the Anchor Text means Anchor Text link, i.e. hypertext link, it establishes text key word and URL chains The relation connect.

Here, it should be noted that example of the matching characteristic information as illustration in embodiment, only for understanding this Invention, matching characteristic information during not as practical application.Unless otherwise instructed, the matching characteristic occurred elsewhere The function of information is with where like, for simplicity, repeats no more.

Those skilled in the art will be understood that the mode of above-mentioned acquisition matching link and/or obtain the mode of the next layer of page It is only for example, the side for the next layer of page of mode and/or acquisition that other acquisition matchings that are existing or being likely to occur from now on link Formula is such as applicable to the present invention, should also be included within the scope of the present invention, and be incorporated herein by reference.

Fig. 6 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention.Wherein, One capture apparatus 614 includes the placement unit 6142 of comparing unit 6141 and the 3rd.When judgment means 612 judge that the next layer of page is During the target information page, comparing unit 6141 is by it compared with having captured in page info；When the next layer of page is not deposited When in described captured in page info, it is the target information page that the 3rd placement unit 6142, which is captured,.

Here, the function of device 611 shown in Fig. 6,612,613 and 615 and the above device 111 described by reference picture 1, 112nd, 113 is identical with 115 content, for simplicity, it is incorporated herein by reference, without repeating.

Here, the page info that captured includes but is not limited to：

The identification information of-target information the page captured, such as the URL or mark ID or identification name of the target information page Claim；

The characteristic information of-target information the page captured；

It, which may be present in, has captured in database, wherein, it is described captured database include but is not limited to relational database, Key-Value storage systems or file system etc..

In one example, when judging the next layer of page for the target information page, comparing unit 6141 parses target letter The making language document of the page is ceased, such as HTML annotations are read by the precalculated position of the html file in the target information page Information；Wherein, the annotation information is the mark ID of the target information page, then, according to mark ID, is capturing data stock It is compared in the page info of crawl of storage, determines that mark ID is not present in having captured in page info, then the 3rd crawl It is the target information page that unit 6142, which is captured, and is stored in the data repository of capture apparatus 1, to realize incremental data Crawl.

Preferably (reference picture 6), capture apparatus 1 also include updating device (not shown), and the updating device is grabbed according to the 3rd The target information page that unit 6142 has captured is taken, preserves or renewal has captured page info.

Specifically, updating device writes the identification information of the target information page captured in the 3rd placement unit 6142 Enter to having captured in database, page info has been captured to preserve or update；If detecting, this has captured database and not set up, This is initialized in advance and has captured database, and then the identification information of the target information page is written to and captured in database.

In one example, updating device is according to the mark of the target information page captured in the 3rd placement unit 6142 ID, a data record for including mark ID is inserted in database has been captured, page info has been captured to preserve or update.

Those skilled in the art will be understood that above-mentioned preservation or renewal have captured the mode of page info and be only for example, other The mode that preservation or renewal existing or be likely to occur from now on have captured page info is such as applicable to the present invention, should also include Within the scope of the present invention, and it is incorporated herein by reference herein.

Fig. 7 shows the method flow diagram for being used to capture website data according to a further aspect of the present invention.

It is described in detail referring to Fig. 7 to capture the process of the target information page to capture apparatus 1：

First, in step s 701, capture apparatus 1 is according to the Website topology information, by the current root page All link selection one does not access link, and obtains the next layer of page of its sensing.

1) URL (URL) of the first floor page of Data web site to be captured；

Here, the making language document includes but is not limited to：

A) HTML (HTML) file；

B) extensible HyperText Markup Language (XHTML) file；

C) extensible markup language (XML) file.

Specifically, when capture apparatus 1 accesses when Data web site is captured first, first, in step s 701, capture apparatus The first floor page URL shown in the 1 Website topology information according to the website, by predetermined communication mode, as http, The communication protocols such as https, to the network equipment of the website, such as webserver, first floor page access request is sent, and receive and be somebody's turn to do The first floor page that the network equipment returns；Then, capture apparatus 1 is using the first floor page of the website as current page, and extracts and be somebody's turn to do The whole links included in current page, and select one therefrom and do not access link；Wherein, because capture apparatus 1 is visited first The website is asked, therefore whole links in the current page are not access link；Then, capture apparatus 1 is according to selected This does not access link, by predetermined communication mode, sends this to the webserver and does not access the link next layer of page of sensing Next layer of page access request, and receive the next layer of page of network equipment return, while will refer in access list Chained record is not accessed to have accessed link to the next layer of page this, and the next layer of page has been accessed for mark.

In one example, first, in step s 701, capture apparatus 1 is according to the Website Topological knot of Data web site to be captured The first floor page URL, such as http in structure information://d.cn, the page of the URL sensing pages is sent to the webserver of the website Face obtains request, and receives the first floor page of webserver return, and as current page A；Then, parsing should Current page A making language document, whole link a1, the a2 included in current page A are therefrom extracted, and be randomly chosen Link a1；Then, capture apparatus 1, by predetermined communication mode, sends link a1 to the webserver and referred to according to link a1 To next layer of page B next layer of page access request, and next layer of page B of network equipment return is received, while Record points to page B link a1 in access list, and the next layer of page B has been accessed for identifying.

Then, in step S702, capture apparatus 1 judges that its next layer of page obtained is according to the first pre-defined rule No is the target information page.

In one example, in step S702, capture apparatus 1 extracts the markup language text of the next layer of page obtained Part, such as html file, and the html file is parsed, so as to be read from the precalculated position of the html file to obtain annotation information： <！-TYPE 3-->, the annotation information and the page type phase one of the target information page predetermined in Website topology information Cause, then judge the next layer of page for the target information page.

Then, when capture apparatus 1 judges the next layer of page for the target information page, in step S704, crawl Equipment 1 captures the target information page.

Here, the mode of the crawl includes but is not limited to following any one：

In one example, when the next layer of page is the target information page, in step S704, the parsing of capture apparatus 1 should The html file of the target information page and whole associated script files, extract text message, picture in the target information page Information and download link information, and those information are stored in the data storage storehouse of capture apparatus 1；Here, the database Including but not limited to relational database, Key-Value storage systems or file system etc..

Meanwhile when capture apparatus 1 judges that the next layer of page not for the target information page, then in step S703, is grabbed Taking equipment 1 repeats capture apparatus 1 in step S701 and step S702 using the next layer of page as the current root page Operation, until meet the first predetermined condition.

Here, first predetermined condition includes：

1) the current root page is without the next layer of page；

2) number for repeating the operation in step S701 and step S702 exceedes pre-determined number.

Specifically, when judging the next layer of page not for the target information page, then in step S703, capture apparatus 1 should The next layer of page repeats its operation in step S701 and step S702 as the current root page, i.e. first, according to The Website topology information of website, link is not accessed from whole link selections one in the current root page, and obtain it and refer to To the next layer of page；Then, according to the first pre-defined rule, judge whether the next layer of page is the target information page；When sentencing When the disconnected next layer of page is the target information page, the target information page is captured；When the next layer of page is not believed for target The page is ceased, then using the next layer of page as the current root page, and capture apparatus 1 repeats aforesaid operations, until satisfaction the One predetermined condition.

In one example, as shown in Fig. 2 when the first floor page of website is A, wherein comprising not accessing link a1 and a2, with Select one not access link a1, the next layer of page B that link a1 is pointed to machine, and judge page B not for target information page During face；Then in step S703, capture apparatus 1 does not access chain using page B as the current root page from the whole in page B Connect in b1 and b2, be randomly chosen one and do not access links of the link b1 as next layer of access, so that according to link b1, to obtain Link b1 is taken to point to next layer of page C, while record points to page C link b1 in access list, has been visited for identifying Ask page C；Then, according to the first pre-defined rule, judge whether next layer of page C is the target information page, when judging C not for mesh When marking information page, then using C as the current root page, do not accessed in link c1 and c2 from C whole and select c1 as next layer The link of access, next layer of page D is pointed to so as to obtain c1, while record points to page D link c1 in access list； Then the first pre-defined rule, judge whether next layer of page D is the target information page；If judging, D for the target information page, is grabbed Take D；If judging next layer of page D not for the target information page, using page D as the current root page, and page D is website Last layer of page, i.e. the current root page without the next layer of page, meet the first predetermined condition, then capture apparatus 1 stops above-mentioned Repeat.

When meeting the second predetermined condition, in step S705, capture apparatus 1 using previous root page face as the current root page, Repeat its operation in step S701, step S702, step S703 and step S704.

Here, second predetermined condition includes following any one：

1) whole links in the current root page have accessed；

2) link of predetermined number has accessed in the current root page.

In one example, as shown in Fig. 2 connecting example, when the current root page is page D, due to current root page D without The next layer of page, that is, meet that whole links in the current root page in the second predetermined condition have accessed, then in step S705 In, capture apparatus 1 is using page D previous root page face C as the current root page, so as to which the whole according to current root page C links C1 and c2, matching inquiry is carried out in access list, it is determined that and select do not access link c2 as next layer access link, Next layer of page E is pointed to obtain c2, then judges whether next layer of page E is the target information page, while in Access Column Record points to page E link c2 in table；Subsequently, as whole link c1 and c2 in current root page C have been accessed, i.e., it is full The second predetermined condition of foot, then capture apparatus 1 is using page C previous root page face B as the current root page, and according to access list In the link of access that shows, all linked from current root page B in b1 and b2 and select not access link b2, as next layer The link of access, so as to according to link b2, point to next layer of page F to obtain link b2, while recorded in access list F link b2 is pointed to, next layer of page F has been accessed for identifying；Next, it is determined that whether next layer of page F is target information page Face；If judging, F for the target information page, captures F；If F is judged not for the target information page, and according to the Website topology Information understands that F is last layer, while understands that whole link b1 and b2 in current root page B have been visited according to access list Ask, that is, meet the second predetermined condition, then using page B previous root page face A as the current root page, show according in access list The link of access gone out, all linked from current root page A in a1 and a2 and select not access link a2, as next layer of access Link, so as to according to link a2, with obtain link a2 point to next layer of page G；Next, it is determined that whether G is target information page Face, when judging G for the target information page, then capture target information page G.

Fig. 8 shows the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, should Process also includes step S806；In step S806, capture apparatus 1 is according to the second pre-defined rule, by the target information of its crawl Determined in whole links of the page and capture target download link.

Here, the capture apparatus 1 shown in Fig. 8 is in step S801, step S802, step S803, step S804 and step Function in S805 is with the above capture apparatus 1 described by reference picture 7 in step S701, step S702, step S703, step S704 is identical with the content in step S705, for simplicity, it is incorporated herein by reference, without repeating.

Determine that the process of crawl target download link is described in detail to capture apparatus 1 referring to Fig. 8：

In one example, when the target information page is H, in step S806, capture apparatus 1 extracts whole chains in H H1 and h2 are met, then, according to link h1 and h2 URL, string matching is carried out with predetermined keyword " .jar ", so that it is determined that The keyword is included in link h1, it is determined that h1 is target download link, and then captures target download link h1 text message And URL.

Fig. 9 shows the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, should Process includes step S9061 and step S9062.In step S9061, capture apparatus 1 is according to the second pre-defined rule, from it in step Download link to be determined is determined in whole links that the target information page captured in rapid S904 includes；In step S9062, grab Downloading data bag corresponding to the download link to be determined that taking equipment 1 determines according to it in step S9061, by it is such it is to be determined under Carry and determined in link and capture target download link.

Here, capture apparatus 1 shown in Fig. 9 is in step S901, step S902, step S903, step S904 and step Function in S905 is with the above capture apparatus 1 described by reference picture 8 in step S801, step S802, step S803, step S804 is identical with the content in step S805, for simplicity, it is incorporated herein by reference, without repeating.

Download link to be determined and determination are determined to capture apparatus 1 and capture target download link referring to Fig. 9 Process is described in detail：

In one example, when the whole that the target information page includes is linked as h1, h2 and h3, in step S9061, crawl Equipment 1 carries out string matching according to the URL of this three links respectively with predetermined keyword " .sis " and " .jar ", so as to Determine that it is to be determined to determine h1 and h2 comprising keyword " .jar " is included in keyword " .sis " and h3 URL in h1 URL Download link；Then, in step S9062, capture apparatus 1 obtains this two links according to download link h1 and h2 to be determined Corresponding downloading data bag, and the header files of two downloading data bags is read to judge whether it is binary data packets, when When judging that downloading data bag is binary data packets corresponding to download link h1, it is determined that download link h1 is that target downloads chain Connect, so as to capture the download link h1 text message and URL, and be stored in the data repository of capture apparatus 1.

Figure 10 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention.Wherein, The process includes step S10011 and step S10012.In step S10011, capture apparatus 1 is according to the Website topology Information and predtermined category list, inquired about in whole links in the current root page, to obtain and described predetermined point One or more matching links that classification in class list matches；In step S10012, capture apparatus 1 is by it in step Selection one does not access link in the one or more matching links obtained in S10011, and obtains next layer of page of its sensing Face.

Here, capture apparatus 1 shown in Figure 10 is in step S1002, step S1003, step S1004 and step S1005 Function is with the above capture apparatus 1 described by reference picture 7 in step S702, step S703, step S704 and step S705 Hold identical, for simplicity, it is incorporated herein by reference, without repeating.

Retouched in detail referring to Figure 10 to obtain matching link to capture apparatus 1 and obtain the process of the next layer of page State：

In one example, first, when wait capture Data web site for mobile terminal apply website is provided when, then in step In S10011, capture apparatus 1 extracts whole link a1, a2, a3 and the a4 included in current root page B, and according to the net of website Stand the matching characteristic information for including link URL in the page of layer where the current root page shown in topology information：

http://c.d.cn/wml/eqp/index/From=.* $,

The URL that a1, a2, a3 and a4 are linked with the whole is matched respectively, to determine and the matching characteristic information phase That matches somebody with somebody is linked as a1, a3 and a4, and the text message for extracting link a1 Anchor Text is " Nokia N8 applications ", links a3 anchor The text message of text is " Nokia E7 applications " and the text message of link a4 Anchor Text is " LG 6660 is applied "；Then, In step S10012, capture apparatus 1 is according to the brand class indication and corresponding type classification shown in predtermined category list It is identified as and " LG 6660 " and " Nokia E7 ", character string is carried out respectively with being linked as a1, a3 and a4 text message of Anchor Text Matching, the matching to match with the brand class indication and type classification mark of acquisition are linked as a3 and a4；Then, grab Taking equipment 1 is linked by matching and selects not access link a4 in a3 and a4 according to access list, and to the website service of the website Device sends acquisition link a4 and points to next layer of page access request of the next layer of page, and receives under Website server return One layer of page.

Figure 11 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention.Wherein, The process also includes step S11041 and step S11042.When capture apparatus judges the next layer of page for mesh in step S1102 When marking information page, in step S11041, capture apparatus 1 is by it compared with having captured in page info；When this is next The layer page is not present in described when having captured in page info, and in step S11042, it is the mesh that capture apparatus 1, which is captured, Mark information page.

Here, capture apparatus 1 shown in Figure 11 is in step S1101, step S1102, step S1103 and step S1105 Function is with the above capture apparatus 1 described by reference picture 7 in step S701, step S702, step S703 and step S705 Hold identical, for simplicity, it is incorporated herein by reference, without repeating.

Here, the page info that captured includes but is not limited to：

The characteristic information of-target information the page captured；

In one example, when judging the next layer of page for the target information page, in step S11041, capture apparatus 1 Parse the making language document of the target information page, such as the precalculated position by the html file in the target information page Read HTML annotation informations；Wherein, the annotation information is the mark ID of the target information page, then, according to mark ID, Capture and be compared in the page info of crawl of database purchase, determine that mark ID is not present in having captured page info In, then in step S11042, it is the target information page that capture apparatus 1, which is captured, and the data for being stored in capture apparatus 1 are deposited In bank, to realize that incremental data captures.

Preferably (reference picture 11), the process also include step S1107 (not shown), in step S1107, capture apparatus The 1 target information page captured according to it in step S11042, is preserved or renewal has captured page info.

Specifically, in step S1107, the target information page that capture apparatus 1 has captured it in step S11042 Identification information, be written to and captured in database, page info has been captured to preserve or update；If detecting, this has captured number Do not set up according to storehouse, then initialize this in advance and captured database, then the identification information of the target information page is written to and grabbed Take in database.

In one example, in step S1107, capture apparatus 1 according to the mark ID of its target information page captured, A data record for including mark ID is inserted in database has been captured, page info has been captured to preserve or update.

It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt With application specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, software program of the invention can realize steps described above or function by computing device.Similarly, it is of the invention Software program (including related data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, Magnetically or optically driver or floppy disc and similar devices.In addition, some steps or function of the present invention can employ hardware to realize, example Such as, coordinate as with processor so as to perform the circuit of each step or function.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.This Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table Show title, and be not offered as any specific order.

Claims

1. a kind of computer implemented method for capturing website data, this method comprise the following steps：

A does not access link, and obtain according to the Website topology information by whole link selections one in the current root page The next layer of page for taking it to point to；

C1 is not the target information page when the next layer of page, then using the next layer of page as the current root page, repeats The step a and b is performed, until meeting the first predetermined condition；

Wherein, this method also includes：

- when meeting the second predetermined condition, using previous root page face as the current root page, repeat described step a, b, c1 and c2；

Wherein, the step a includes：

- according to the Website topology information and predtermined category list, enter in whole links in the current root page Row inquiry, matched and linked with the one or more obtained with the classification in the predtermined category list matches；

- by it is one or more of matching link in selection one do not access link, and obtain its point to the next layer of page.

2. according to the method for claim 1, wherein, methods described also includes：

Y is according to the second pre-defined rule, by being determined in whole links of the target information page and capturing target download link.

3. according to the method for claim 2, wherein, the step y includes：

- according to the second pre-defined rule, determine download link to be determined in the whole links included from the target information page；

- according to corresponding to the download link to be determined downloading data bag, by determining and capturing in the download link to be determined The target download link.

4. according to the method in any one of claims 1 to 3, wherein, the step c2 includes：

- when judging the next layer of page for the target information page, by it compared with having captured in page info；

- when the next layer of page is not present in described captured in page info, it is the target information page to be captured.

5. according to the method for claim 4, wherein, this method also includes：

Page info has been captured described in the target information page that-basis has captured, preservation or renewal.

6. a kind of equipment for capturing website data, the equipment includes：

First acquisition device, for according to the Website topology information, one to be selected by whole links in the current root page It is individual not access link, and obtain the next layer of page of its sensing；

First circulation device, the next layer of page is judged not for the target information page for working as, then by the next layer of page As the current root page, the operation of first acquisition device and the judgment means is repeated, until meeting that first is predetermined Condition；

Wherein, the equipment also includes：

First grabbing device, for when judging the next layer of page for the target information page, capturing the target information page Face；

Wherein, the equipment also includes：

Second circulation device, for when meeting the second predetermined condition, using previous root page face as the current root page, repeating institute State the operation of the first acquisition device, the judgment means, the first circulation judgment means and first grabbing device；

Wherein, first acquisition device includes：

Second acquisition unit, for according to the Website topology information and predtermined category list, in the current root page In whole links in inquired about, one or more matched with what the classification in the predtermined category list matched with obtaining Link；

3rd acquiring unit, for not accessing link by selection one in one or more of matching links, and obtain it and refer to To the next layer of page.

7. equipment according to claim 6, wherein, the equipment also includes：

Second grabbing device, for according to the second pre-defined rule, determining and grabbing in being linked by the whole of the target information page Take target download link.

8. equipment according to claim 7, wherein, second grabbing device includes：

Determining unit is linked, for according to the second pre-defined rule, being determined in the whole links included from the target information page Download link to be determined；

3rd placement unit, for the downloading data bag according to corresponding to the download link to be determined, by the download to be determined Determined in link and capture the target download link.

9. the equipment according to any one of claim 6 to 8, wherein, first capture apparatus includes：

Comparing unit, for when judging the next layer of page for the target information page, by it with having captured in page info It is compared；

3rd placement unit, it is for when the next layer of page is not present in described captured in page info, being captured The target information page.

10. equipment according to claim 9, wherein, the equipment also includes：

Page info has been captured described in updating device, the target information page captured for basis, preservation or renewal.