CN103246675B - A kind of method and apparatus for being used to capture website data - Google Patents
A kind of method and apparatus for being used to capture website data Download PDFInfo
- Publication number
- CN103246675B CN103246675B CN201210030588.6A CN201210030588A CN103246675B CN 103246675 B CN103246675 B CN 103246675B CN 201210030588 A CN201210030588 A CN 201210030588A CN 103246675 B CN103246675 B CN 103246675B
- Authority
- CN
- China
- Prior art keywords
- page
- next layer
- link
- target information
- captured
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
It is an object of the invention to provide a kind of method and apparatus for being used to capture website data.First, according to the Website topology information, link is not accessed by whole link selections one in the current root page, and obtain the next layer of page of its sensing;Then, according to the first pre-defined rule, judge whether the next layer of page is the target information page;When the next layer of page is not the target information page, then using the next layer of page as the current root page, the step a and b is repeated, until meeting the first predetermined condition;When judging the next layer of page for the target information page, the target information page is captured;When meeting the second predetermined condition, using previous root page face as the current root page, described step a, b, c1 and c2 are repeated.Compared with prior art, the present invention realizes the target data for capturing whole website by the way of depth-first traversal, ensure that the accuracy to target data crawl, improves the efficiency of data grabber.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of technology for being used to capture website data.
Background technology
In the prior art, website crawl data are provided from data to generally require for one pin of the independent execution in each website
This, but when data offer Websites quantity is more, it is necessary to safeguard set crawl scripts, therefore script maintenance cost is higher more, number
It is inefficient according to capturing;Meanwhile after data provide website setting classification information, can there be last time in its server end and set
The cookie information of classification information is put, but because traditional data crawl typically uses the Grasp Modes of breadth First, and same
When classification information is changed in the page, the URL (URL) of the page link will not change so that access same one page
After each classification information link in face, the data that may be grabbed are the classification of the last time selection recorded in cookie information
Information, rather than target data corresponding to each classification information of desired crawl, the accuracy of data grabber be not high.
Therefore, effective crawl of website data how is realized, turns into one of current urgent problem to be solved.
The content of the invention
It is an object of the invention to provide a kind of method and apparatus for being used to capture website data.
According to an aspect of the invention, there is provided a kind of computer implemented method for capturing website data, should
Method comprises the following steps:
A does not access link according to the Website topology information, by whole link selections one in the current root page,
And obtain the next layer of page of its sensing;
B judges whether the next layer of page is the target information page according to the first pre-defined rule;
C1 is not the target information page when the next layer of page, then using the next layer of page as the current root page,
The step a and b is repeated, until meeting the first predetermined condition;
C2 captures the target information page when judging the next layer of page for the target information page;
Wherein, this method also includes:
- when meeting the second predetermined condition, using previous root page face as the current root page, repeat described step a, b, c1
And c2.
According to another aspect of the present invention, a kind of equipment for capturing website data is additionally provided, the equipment includes:
First acquisition device, for according to the Website topology information, by whole chain selectings in the current root page
Select one and do not access link, and obtain the next layer of page of its sensing;
Judgment means, for according to the first pre-defined rule, judging whether the next layer of page is the target information page;
First circulation device, the next layer of page is judged not for the target information page for working as, then by next layer
The page repeats the operation of first acquisition device and the judgment means, until meeting first as the current root page
Predetermined condition;
Wherein, the equipment also includes:
First grabbing device, for when judging the next layer of page for the target information page, capturing the target letter
Cease the page;
Wherein, the equipment also includes:
Second circulation device, for when meeting the second predetermined condition, being held previous root page face as the current root page, repetition
The operation of row first acquisition device, the judgment means, the first circulation judgment means and first grabbing device.
Compared with prior art, the Website topology information of present invention basis Data web site to be captured, it is excellent using depth
The mode first traveled through, the target data for capturing whole website is realized, so as to reduce the maintenance cost of more script data crawls,
And the accuracy to target data crawl is ensure that, improve the efficiency of data grabber.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 shows the equipment schematic diagram for being used to capture website data according to one aspect of the invention;
Fig. 2 shows the exemplary plot for being used to capture website data in accordance with a preferred embodiment of the present invention;
Fig. 3 goes out the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention;
Fig. 4 goes out the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention;
Fig. 5 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention;
Fig. 6 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention;
Fig. 7 shows the method flow diagram for being used to capture website data according to a further aspect of the present invention;
Fig. 8 goes out the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention;
Fig. 9 goes out the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention;
Figure 10 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention;
Figure 11 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention.
Same or analogous reference represents same or analogous part in accompanying drawing.
Embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 shows the equipment schematic diagram for being used to capture website data according to one aspect of the invention.Wherein, capture apparatus
1 includes the first acquisition device 111, judgment means 112, first circulation device 113, the first grabbing device 114 and second circulation dress
Put 115.
Here, capture apparatus 1 is the network equipment, it includes but is not limited to computer, network host, single network service
The cloud that device, multiple webserver collection or multiple servers are formed.Here, cloud is by based on cloud computing (Cloud Computing)
A large amount of computers or the webserver form, wherein, cloud computing is one kind of Distributed Calculation, by the meter of a group loose couplings
One super virtual computer of calculation machine collection composition.
Communicated here, can be realized between capture apparatus 1 and the network equipment of website by any communication mode, including but not
Be limited to, the mobile communication based on 3GPP, LTE, WIMAX, based on TCP/IP, udp protocol computer network communication and be based on
The low coverage wireless transmission method of bluetooth, Infrared Transmission standard.
It is described in detail referring to Fig. 1 to capture the process of the target information page to capture apparatus 1:
First, the first acquisition device 111 is linked according to the Website topology information by the whole in the current root page
Selection one does not access link, and obtains the next layer of page of its sensing.
Here, the Website topology information includes but is not limited to following any one:
1) URL (URL) of the first floor page of Data web site to be captured;
2) number of plies information of Data web site to be captured, the i.e. first floor page (first layer) to the target information page (last layer)
The number of plies;
3) the matching characteristic information of link is included in every layer of page of Data web site to be captured;
4) page identification information of every layer of page of Data web site to be captured;Wherein, the page identification information can be located at page
The customized label of the making language document in face, annotation etc.;
Here, the making language document includes but is not limited to:
A) HTML (HTML) file;
B) extensible HyperText Markup Language (XHTML) file;
C) extensible markup language (XML) file.
Specifically, when the first acquisition device 111 accesses when Data web site is captured first, first, the first acquisition device 111
According to the first floor page URL shown in the Website topology information of the website, by predetermined communication mode, as http,
The communication protocols such as https, to the network equipment of the website, such as webserver, first floor page access request is sent, and receive and be somebody's turn to do
The first floor page that the network equipment returns;Then, the first acquisition device 111 is using the first floor page of the website as current page, and
The whole links included in the current page are extracted, and selects one therefrom and does not access link;Wherein, due to capture apparatus 1
The website is accessed first, therefore whole links in the current page are not access link;Then, the first acquisition device 111
Link is not accessed according to selected this, by predetermined communication mode, is sent this to the webserver and is not accessed link and refer to
To next layer of page access request of the next layer of page, and the next layer of page of network equipment return is received, while visited
Ask in list by point to the next layer of page this do not access chained record to have accessed link, it is next to have accessed this for mark
The layer page.
In one example, first, in Website topology information of first acquisition device 111 according to Data web site to be captured
The first floor page URL, such as http://d.cn, the page acquisition that the URL sensing pages are sent to the webserver of the website please
Ask, and receive the first floor page of webserver return, and as current page A;Then, the current page is parsed
A making language document, whole link a1, the a2 included in current page A are therefrom extracted, and be randomly chosen link a1;
Then, the first acquisition device 111, by predetermined communication mode, sends link a1 to the webserver and referred to according to link a1
To next layer of page B next layer of page access request, and next layer of page B of network equipment return is received, while
Record points to page B link a1 in access list, and the next layer of page B has been accessed for identifying.
Those skilled in the art will be understood that the mode of the next layer of page of above-mentioned acquisition is only for example, and other are existing or modern
The mode for the next layer of page of acquisition being likely to occur afterwards is such as applicable to the present invention, also should be included in the scope of the present invention with
It is interior, and be incorporated herein by reference.
Next, it is determined that device 112 judges the next layer of page that the first acquisition device 111 obtains according to the first pre-defined rule
Whether it is the target information page.
Here, the judgment mode according to the first pre-defined rule includes but is not limited to:
- by the page of the page identification information in the making language document of the next layer of page and the predetermined target information page
Type is compared, to judge whether the next layer of page is the target information page.
In one example, judgment means 112 extract the making language document of the next layer of page obtained, such as HTML texts
Part, and the html file is parsed, so as to be read from the precalculated position of the html file to obtain annotation information:<!-TYPE 3--
>, the annotation information is consistent with the page type of the target information page predetermined in Website topology information, then judges under this
One layer of page is the target information page.
Those skilled in the art will be understood that the mode of the above-mentioned judgement target information page is only for example, other it is existing or
The mode for the judgement target information page being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Then, when judgment means 112 judge the next layer of page for the target information page, the first grabbing device 114
Capture the target information page.
Here, the mode of the crawl includes but is not limited to following any one:
1) making language document of the target information page and whole associated script files are captured, such as CSS, JavaScript
Deng;
2) text message, pictorial information and the download link in the target information page are captured.
In one example, when the next layer of page is the target information page, the first grabbing device 114 parses the target information
The html file of the page and whole associated script files, extract text message in the target information page, pictorial information and under
Link information is carried, and those information are stored in the data storage storehouse of capture apparatus 1;Here, the database is included but not
It is limited to relational database, Key-Value storage systems or file system etc..
Those skilled in the art will be understood that the mode of the above-mentioned crawl target information page is only for example, other it is existing or
The mode for the crawl target information page being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Meanwhile when judgment means 112 judge the next layer of page not for the target information page, then first circulation device
113 using the next layer of page as the current root page, repeats the first acquisition device 111 and the operation of judgment means 112,
Until meet the first predetermined condition.
Here, first predetermined condition includes:
1) the current root page is without the next layer of page;
2) repeat the first acquisition device 111 and the number of the operation of judgment means 112 exceedes pre-determined number.
Specifically, when judging the next layer of page not for the target information page, then first circulation device 113 is by the next layer of page
Face repeats the first acquisition device 111 and the operation of judgment means 112, i.e. first, according to website as the current root page
Website topology information, do not access link from whole link selections one in the current root page, and obtain what it was pointed to
The next layer of page;Then, according to the first pre-defined rule, judge whether the next layer of page is the target information page;When judgement should
When the next layer of page is the target information page, the target information page is captured;When the next layer of page is not target information page
Face, then using the next layer of page as the current root page, and first circulation device 113 repeats aforesaid operations, until meeting
First predetermined condition.
In one example, as shown in Fig. 2 when the first floor page of website is A, wherein comprising not accessing link a1 and a2, with
Select one not access link a1, the next layer of page B that link a1 is pointed to machine, and judge page B not for target information page
During face;Then first circulation device 113 is using page B as the current root page, and does not access link b1 and b2 from the whole in page B
In, it is randomly chosen one and does not access links of the link b1 as next layer of access, so as to which according to link b1, b1 is linked to obtain
Next layer of page C is pointed to, while record points to page C link b1 in access list, for identifying accession page C;
Then, according to the first pre-defined rule, judge whether next layer of page C is the target information page, when judging C not for target information page
During face, then using C as the current root page, do not accessed from C whole and chains of the c1 as next layer of access is selected in link c1 and c2
Connect, point to next layer of page D so as to obtain c1, while record points to page D link c1 in access list;Then first
Pre-defined rule, judge whether next layer of page D is the target information page;If judging, D for the target information page, captures D;If sentence
The next layer of page D that break is not the target information page, then using page D as the current root page, and page D is last layer of website
The page, i.e. the current root page without the next layer of page, meet the first predetermined condition, then first circulation device 113 stops above-mentioned heavy
Multiple operation.
Those skilled in the art will be understood that the above-mentioned mode for repeating operation is only for example, and other are existing or from now on
The mode for repeating operation being likely to occur such as is applicable to the present invention, should also be included within the scope of the present invention, and
It is incorporated herein by reference.
When meeting the second predetermined condition, second circulation device 115 repeats using previous root page face as the current root page
The operation of first acquisition device 111, judgment means 112, the grabbing device 113 of first circulation judgment means 114 and first.
Here, second predetermined condition includes following any one:
1) whole links in the current root page have accessed;
2) link of predetermined number has accessed in the current root page.
In one example, as shown in Fig. 2 connecting example, when the current root page is page D, due to current root page D without
The next layer of page, that is, meet that whole links in the current root page in the second predetermined condition have accessed, then second circulation fills
115 are put using page D previous root page face C as the current root page, so that c1 and c2 are linked according to the whole of current root page C,
Matching inquiry is carried out in access list, it is determined that and select do not access link c2 as next layer access link, with obtain
C2 points to next layer of page E, then judges whether next layer of page E is the target information page, while remember in access list
Page E link c2 is pointed in record;Subsequently, as whole link c1 and c2 in current root page C have been accessed, that is, meet second
Predetermined condition, then second circulation device 115 is using page C previous root page face B as the current root page, and according to access list
In the link of access that shows, all linked from current root page B in b1 and b2 and select not access link b2, as next layer
The link of access, so as to according to link b2, point to next layer of page F to obtain link b2, while recorded in access list
F link b2 is pointed to, next layer of page F has been accessed for identifying;Next, it is determined that whether next layer of page F is target information page
Face;If judging, F for the target information page, captures F;If F is judged not for the target information page, and according to the Website topology
Information understands that F is last layer, while understands that whole link b1 and b2 in current root page B have been visited according to access list
Ask, that is, meet the second predetermined condition, then using page B previous root page face A as the current root page, show according in access list
The link of access gone out, all linked from current root page A in a1 and a2 and select not access link a2, as next layer of access
Link, so as to according to link a2, with obtain link a2 point to next layer of page G;Next, it is determined that whether G is target information page
Face, when judging G for the target information page, then capture target information page G.
Those skilled in the art will be understood that the above-mentioned mode for repeating operation is only for example, and other are existing or from now on
The mode for repeating operation being likely to occur such as is applicable to the present invention, should also be included within the scope of the present invention, and
It is incorporated herein by reference.
Fig. 3 shows the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, grab
Taking equipment 1 also includes the second grabbing device 316;Second grabbing device 316 is according to the second pre-defined rule, by the first grabbing device
Determined in whole links of the target information page of 314 crawls and capture target download link.
Here, the function of device 311 shown in Fig. 3,312,313,314 and 315 and the above device described by reference picture 1
111st, 112,113,114 is identical with 115 content, for simplicity, it is incorporated herein by reference, without repeating.
Determine that the process of crawl target download link is described in detail to capture apparatus 1 referring to Fig. 3:
Here, the mode for being determined according to the second pre-defined rule and capturing target download link includes but is not limited to:
- according to the URL linked in the target information page, by way of Keywords matching, to determine and capture under target
Carry link.
In one example, when the target information page is H, whole link h1 in the second grabbing device 316 extraction H and
H2, then, according to link h1 and h2 URL, string matching is carried out with predetermined keyword " .jar ", so that it is determined that link h1
In include the keyword, it is determined that h1 is target download link, and then captures target download link h1 text message and URL.
Those skilled in the art will be understood that the mode of above-mentioned determination target download link is only for example, other it is existing or
The mode for being likely to occur the download link that sets the goal really from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Fig. 4 shows the equipment schematic diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein,
Two grabbing devices 416 include the link placement unit 4162 of determining unit 4161 and the 3rd.Determining unit 4161 is linked according to second
Pre-defined rule, download chain to be determined is determined in the whole links included from the target information page of the first grabbing device 414 crawl
Connect;3rd placement unit 4162 downloading data bag according to corresponding to the download link to be determined that link determining unit 4161 determines,
By being determined in such download link to be determined and capturing target download link.
Here, the function of device 411 shown in Fig. 4,412,413,414 and 415 and the above device described by reference picture 3
311st, 312,313,314 is identical with 315 content, for simplicity, it is incorporated herein by reference, without repeating.
Download link to be determined and determination are determined to capture apparatus 1 and capture target download link referring to Fig. 4
Process is described in detail:
In one example, when the whole that the target information page includes is linked as h1, h2 and h3, link determining unit 4161
According to the URL of this three links, string matching is carried out respectively with predetermined keyword " .sis " and " .jar ", so that it is determined that h1
Comprising keyword " .jar " is included in keyword " .sis " and h3 URL in URL, that is, determine that h1 and h2 is download chain to be determined
Connect;Then, the 3rd placement unit 4162 obtains downloading data corresponding to this two links according to download link h1 and h2 to be determined
Bag, and read the header files of two downloading data bags to judge whether it is binary data packets, when judging download link h1
When corresponding downloading data bag is binary data packets, it is determined that download link h1 is target download link, so as to capture under this
Link h1 text message and URL are carried, and is stored in the data repository of capture apparatus 1.
Those skilled in the art will be understood that the mode of above-mentioned determination download link to be determined and/or determination and capture target
The mode of download link is only for example, the mode of other determination that is existing or being likely to occur from now on download links to be determined and/
Or determine and capture the mode of target download link to be such as applicable to the present invention, it should also be included within the scope of the present invention,
And it is incorporated herein by reference.
Fig. 5 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention.Wherein,
One acquisition device 511 includes the acquiring unit 5112 of second acquisition unit 5111 and the 3rd.Second acquisition unit 5111 is according to described
Website topology information and predtermined category list, inquired about in whole links in the current root page, to obtain
Match and link with the one or more that the classification in the predtermined category list matches;3rd acquiring unit 5112 is obtained by second
Selection one does not access link during the one or more matchings for taking unit 5111 to obtain link, and obtains next layer of page of its sensing
Face.
Here, the function of device 512 shown in Fig. 5,513,514 and 515 and the above device 112 described by reference picture 1,
113rd, 114 is identical with 115 content, for simplicity, it is incorporated herein by reference, without repeating.
Retouched in detail referring to Fig. 5 to obtain matching link to capture apparatus 1 and obtain the process of the next layer of page
State:
Here, the target information page can be belonging respectively to different classification;For example, capture apparatus 1 for particular brand,
When the mobile terminal crawl target information page and target download link of specific model, then show the need for grabbing in predtermined category list
The particular brand and the class indication of specific model taken.
In one example, first, when wait capture Data web site for mobile terminal application website is provided when, then second acquisition
Unit 5111 extracts whole link a1, a2, a3 and the a4 included in current root page B, and according to the Website topology of website
The matching characteristic information of link URL is included in the page of layer where the current root page shown in information:
http://c.d.cn/wml/eqp/index/From=.* $,
The URL that a1, a2, a3 and a4 are linked with the whole is matched respectively, to determine and the matching characteristic information phase
That matches somebody with somebody is linked as a1, a3 and a4, and the text message for extracting link a1 Anchor Text is " Nokia N8 applications ", links a3 anchor
The text message of text is " Nokia E7 applications " and the text message of link a4 Anchor Text is " LG 6660 is applied ";Then,
Second acquisition unit 5111 is identified as " LG according to the brand class indication and corresponding type classification that are shown in predtermined category list
6660 " and " Nokia E7 ", string matching is carried out respectively with being linked as a1, a3 and a4 text message of Anchor Text, obtain
Be linked as a3 and a4 with the brand class indication and the matching that matches of type classification mark;Then, the 3rd list is obtained
Member 5112 is linked by matching and selects not access link a4 in a3 and a4 according to access list, and to the website service of the website
Device sends acquisition link a4 and points to next layer of page access request of the next layer of page, and receives under Website server return
One layer of page.
Here, the Anchor Text means Anchor Text link, i.e. hypertext link, it establishes text key word and URL chains
The relation connect.
Here, it should be noted that example of the matching characteristic information as illustration in embodiment, only for understanding this
Invention, matching characteristic information during not as practical application.Unless otherwise instructed, the matching characteristic occurred elsewhere
The function of information is with where like, for simplicity, repeats no more.
Those skilled in the art will be understood that the mode of above-mentioned acquisition matching link and/or obtain the mode of the next layer of page
It is only for example, the side for the next layer of page of mode and/or acquisition that other acquisition matchings that are existing or being likely to occur from now on link
Formula is such as applicable to the present invention, should also be included within the scope of the present invention, and be incorporated herein by reference.
Fig. 6 shows the equipment schematic diagram for being used to capture website data according to further embodiment of the present invention.Wherein,
One capture apparatus 614 includes the placement unit 6142 of comparing unit 6141 and the 3rd.When judgment means 612 judge that the next layer of page is
During the target information page, comparing unit 6141 is by it compared with having captured in page info;When the next layer of page is not deposited
When in described captured in page info, it is the target information page that the 3rd placement unit 6142, which is captured,.
Here, the function of device 611 shown in Fig. 6,612,613 and 615 and the above device 111 described by reference picture 1,
112nd, 113 is identical with 115 content, for simplicity, it is incorporated herein by reference, without repeating.
Here, the page info that captured includes but is not limited to:
The identification information of-target information the page captured, such as the URL or mark ID or identification name of the target information page
Claim;
The characteristic information of-target information the page captured;
It, which may be present in, has captured in database, wherein, it is described captured database include but is not limited to relational database,
Key-Value storage systems or file system etc..
In one example, when judging the next layer of page for the target information page, comparing unit 6141 parses target letter
The making language document of the page is ceased, such as HTML annotations are read by the precalculated position of the html file in the target information page
Information;Wherein, the annotation information is the mark ID of the target information page, then, according to mark ID, is capturing data stock
It is compared in the page info of crawl of storage, determines that mark ID is not present in having captured in page info, then the 3rd crawl
It is the target information page that unit 6142, which is captured, and is stored in the data repository of capture apparatus 1, to realize incremental data
Crawl.
Preferably (reference picture 6), capture apparatus 1 also include updating device (not shown), and the updating device is grabbed according to the 3rd
The target information page that unit 6142 has captured is taken, preserves or renewal has captured page info.
Specifically, updating device writes the identification information of the target information page captured in the 3rd placement unit 6142
Enter to having captured in database, page info has been captured to preserve or update;If detecting, this has captured database and not set up,
This is initialized in advance and has captured database, and then the identification information of the target information page is written to and captured in database.
In one example, updating device is according to the mark of the target information page captured in the 3rd placement unit 6142
ID, a data record for including mark ID is inserted in database has been captured, page info has been captured to preserve or update.
Those skilled in the art will be understood that above-mentioned preservation or renewal have captured the mode of page info and be only for example, other
The mode that preservation or renewal existing or be likely to occur from now on have captured page info is such as applicable to the present invention, should also include
Within the scope of the present invention, and it is incorporated herein by reference herein.
Fig. 7 shows the method flow diagram for being used to capture website data according to a further aspect of the present invention.
Here, capture apparatus 1 is the network equipment, it includes but is not limited to computer, network host, single network service
The cloud that device, multiple webserver collection or multiple servers are formed.Here, cloud is by based on cloud computing (Cloud Computing)
A large amount of computers or the webserver form, wherein, cloud computing is one kind of Distributed Calculation, by the meter of a group loose couplings
One super virtual computer of calculation machine collection composition.
Communicated here, can be realized between capture apparatus 1 and the network equipment of website by any communication mode, including but not
Be limited to, the mobile communication based on 3GPP, LTE, WIMAX, based on TCP/IP, udp protocol computer network communication and be based on
The low coverage wireless transmission method of bluetooth, Infrared Transmission standard.
It is described in detail referring to Fig. 7 to capture the process of the target information page to capture apparatus 1:
First, in step s 701, capture apparatus 1 is according to the Website topology information, by the current root page
All link selection one does not access link, and obtains the next layer of page of its sensing.
Here, the Website topology information includes but is not limited to following any one:
1) URL (URL) of the first floor page of Data web site to be captured;
2) number of plies information of Data web site to be captured, the i.e. first floor page (first layer) to the target information page (last layer)
The number of plies;
3) the matching characteristic information of link is included in every layer of page of Data web site to be captured;
4) page identification information of every layer of page of Data web site to be captured;Wherein, the page identification information can be located at page
The customized label of the making language document in face, annotation etc.;
Here, the making language document includes but is not limited to:
A) HTML (HTML) file;
B) extensible HyperText Markup Language (XHTML) file;
C) extensible markup language (XML) file.
Specifically, when capture apparatus 1 accesses when Data web site is captured first, first, in step s 701, capture apparatus
The first floor page URL shown in the 1 Website topology information according to the website, by predetermined communication mode, as http,
The communication protocols such as https, to the network equipment of the website, such as webserver, first floor page access request is sent, and receive and be somebody's turn to do
The first floor page that the network equipment returns;Then, capture apparatus 1 is using the first floor page of the website as current page, and extracts and be somebody's turn to do
The whole links included in current page, and select one therefrom and do not access link;Wherein, because capture apparatus 1 is visited first
The website is asked, therefore whole links in the current page are not access link;Then, capture apparatus 1 is according to selected
This does not access link, by predetermined communication mode, sends this to the webserver and does not access the link next layer of page of sensing
Next layer of page access request, and receive the next layer of page of network equipment return, while will refer in access list
Chained record is not accessed to have accessed link to the next layer of page this, and the next layer of page has been accessed for mark.
In one example, first, in step s 701, capture apparatus 1 is according to the Website Topological knot of Data web site to be captured
The first floor page URL, such as http in structure information://d.cn, the page of the URL sensing pages is sent to the webserver of the website
Face obtains request, and receives the first floor page of webserver return, and as current page A;Then, parsing should
Current page A making language document, whole link a1, the a2 included in current page A are therefrom extracted, and be randomly chosen
Link a1;Then, capture apparatus 1, by predetermined communication mode, sends link a1 to the webserver and referred to according to link a1
To next layer of page B next layer of page access request, and next layer of page B of network equipment return is received, while
Record points to page B link a1 in access list, and the next layer of page B has been accessed for identifying.
Those skilled in the art will be understood that the mode of the next layer of page of above-mentioned acquisition is only for example, and other are existing or modern
The mode for the next layer of page of acquisition being likely to occur afterwards is such as applicable to the present invention, also should be included in the scope of the present invention with
It is interior, and be incorporated herein by reference.
Then, in step S702, capture apparatus 1 judges that its next layer of page obtained is according to the first pre-defined rule
No is the target information page.
Here, the judgment mode according to the first pre-defined rule includes but is not limited to:
- by the page of the page identification information in the making language document of the next layer of page and the predetermined target information page
Type is compared, to judge whether the next layer of page is the target information page.
In one example, in step S702, capture apparatus 1 extracts the markup language text of the next layer of page obtained
Part, such as html file, and the html file is parsed, so as to be read from the precalculated position of the html file to obtain annotation information:
<!-TYPE 3-->, the annotation information and the page type phase one of the target information page predetermined in Website topology information
Cause, then judge the next layer of page for the target information page.
Those skilled in the art will be understood that the mode of the above-mentioned judgement target information page is only for example, other it is existing or
The mode for the judgement target information page being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Then, when capture apparatus 1 judges the next layer of page for the target information page, in step S704, crawl
Equipment 1 captures the target information page.
Here, the mode of the crawl includes but is not limited to following any one:
1) making language document of the target information page and whole associated script files are captured, such as CSS, JavaScript
Deng;
2) text message, pictorial information and the download link in the target information page are captured.
In one example, when the next layer of page is the target information page, in step S704, the parsing of capture apparatus 1 should
The html file of the target information page and whole associated script files, extract text message, picture in the target information page
Information and download link information, and those information are stored in the data storage storehouse of capture apparatus 1;Here, the database
Including but not limited to relational database, Key-Value storage systems or file system etc..
Those skilled in the art will be understood that the mode of the above-mentioned crawl target information page is only for example, other it is existing or
The mode for the crawl target information page being likely to occur from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Meanwhile when capture apparatus 1 judges that the next layer of page not for the target information page, then in step S703, is grabbed
Taking equipment 1 repeats capture apparatus 1 in step S701 and step S702 using the next layer of page as the current root page
Operation, until meet the first predetermined condition.
Here, first predetermined condition includes:
1) the current root page is without the next layer of page;
2) number for repeating the operation in step S701 and step S702 exceedes pre-determined number.
Specifically, when judging the next layer of page not for the target information page, then in step S703, capture apparatus 1 should
The next layer of page repeats its operation in step S701 and step S702 as the current root page, i.e. first, according to
The Website topology information of website, link is not accessed from whole link selections one in the current root page, and obtain it and refer to
To the next layer of page;Then, according to the first pre-defined rule, judge whether the next layer of page is the target information page;When sentencing
When the disconnected next layer of page is the target information page, the target information page is captured;When the next layer of page is not believed for target
The page is ceased, then using the next layer of page as the current root page, and capture apparatus 1 repeats aforesaid operations, until satisfaction the
One predetermined condition.
In one example, as shown in Fig. 2 when the first floor page of website is A, wherein comprising not accessing link a1 and a2, with
Select one not access link a1, the next layer of page B that link a1 is pointed to machine, and judge page B not for target information page
During face;Then in step S703, capture apparatus 1 does not access chain using page B as the current root page from the whole in page B
Connect in b1 and b2, be randomly chosen one and do not access links of the link b1 as next layer of access, so that according to link b1, to obtain
Link b1 is taken to point to next layer of page C, while record points to page C link b1 in access list, has been visited for identifying
Ask page C;Then, according to the first pre-defined rule, judge whether next layer of page C is the target information page, when judging C not for mesh
When marking information page, then using C as the current root page, do not accessed in link c1 and c2 from C whole and select c1 as next layer
The link of access, next layer of page D is pointed to so as to obtain c1, while record points to page D link c1 in access list;
Then the first pre-defined rule, judge whether next layer of page D is the target information page;If judging, D for the target information page, is grabbed
Take D;If judging next layer of page D not for the target information page, using page D as the current root page, and page D is website
Last layer of page, i.e. the current root page without the next layer of page, meet the first predetermined condition, then capture apparatus 1 stops above-mentioned
Repeat.
Those skilled in the art will be understood that the above-mentioned mode for repeating operation is only for example, and other are existing or from now on
The mode for repeating operation being likely to occur such as is applicable to the present invention, should also be included within the scope of the present invention, and
It is incorporated herein by reference.
When meeting the second predetermined condition, in step S705, capture apparatus 1 using previous root page face as the current root page,
Repeat its operation in step S701, step S702, step S703 and step S704.
Here, second predetermined condition includes following any one:
1) whole links in the current root page have accessed;
2) link of predetermined number has accessed in the current root page.
In one example, as shown in Fig. 2 connecting example, when the current root page is page D, due to current root page D without
The next layer of page, that is, meet that whole links in the current root page in the second predetermined condition have accessed, then in step S705
In, capture apparatus 1 is using page D previous root page face C as the current root page, so as to which the whole according to current root page C links
C1 and c2, matching inquiry is carried out in access list, it is determined that and select do not access link c2 as next layer access link,
Next layer of page E is pointed to obtain c2, then judges whether next layer of page E is the target information page, while in Access Column
Record points to page E link c2 in table;Subsequently, as whole link c1 and c2 in current root page C have been accessed, i.e., it is full
The second predetermined condition of foot, then capture apparatus 1 is using page C previous root page face B as the current root page, and according to access list
In the link of access that shows, all linked from current root page B in b1 and b2 and select not access link b2, as next layer
The link of access, so as to according to link b2, point to next layer of page F to obtain link b2, while recorded in access list
F link b2 is pointed to, next layer of page F has been accessed for identifying;Next, it is determined that whether next layer of page F is target information page
Face;If judging, F for the target information page, captures F;If F is judged not for the target information page, and according to the Website topology
Information understands that F is last layer, while understands that whole link b1 and b2 in current root page B have been visited according to access list
Ask, that is, meet the second predetermined condition, then using page B previous root page face A as the current root page, show according in access list
The link of access gone out, all linked from current root page A in a1 and a2 and select not access link a2, as next layer of access
Link, so as to according to link a2, with obtain link a2 point to next layer of page G;Next, it is determined that whether G is target information page
Face, when judging G for the target information page, then capture target information page G.
Those skilled in the art will be understood that the above-mentioned mode for repeating operation is only for example, and other are existing or from now on
The mode for repeating operation being likely to occur such as is applicable to the present invention, should also be included within the scope of the present invention, and
It is incorporated herein by reference.
Fig. 8 shows the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, should
Process also includes step S806;In step S806, capture apparatus 1 is according to the second pre-defined rule, by the target information of its crawl
Determined in whole links of the page and capture target download link.
Here, the capture apparatus 1 shown in Fig. 8 is in step S801, step S802, step S803, step S804 and step
Function in S805 is with the above capture apparatus 1 described by reference picture 7 in step S701, step S702, step S703, step
S704 is identical with the content in step S705, for simplicity, it is incorporated herein by reference, without repeating.
Determine that the process of crawl target download link is described in detail to capture apparatus 1 referring to Fig. 8:
Here, the mode for being determined according to the second pre-defined rule and capturing target download link includes but is not limited to:
- according to the URL linked in the target information page, by way of Keywords matching, to determine and capture under target
Carry link.
In one example, when the target information page is H, in step S806, capture apparatus 1 extracts whole chains in H
H1 and h2 are met, then, according to link h1 and h2 URL, string matching is carried out with predetermined keyword " .jar ", so that it is determined that
The keyword is included in link h1, it is determined that h1 is target download link, and then captures target download link h1 text message
And URL.
Those skilled in the art will be understood that the mode of above-mentioned determination target download link is only for example, other it is existing or
The mode for being likely to occur the download link that sets the goal really from now on is such as applicable to the present invention, should also be included in the scope of the present invention
Within, and be incorporated herein by reference.
Fig. 9 shows the method flow diagram for being used to capture website data according to another preferred embodiment of the present invention.Wherein, should
Process includes step S9061 and step S9062.In step S9061, capture apparatus 1 is according to the second pre-defined rule, from it in step
Download link to be determined is determined in whole links that the target information page captured in rapid S904 includes;In step S9062, grab
Downloading data bag corresponding to the download link to be determined that taking equipment 1 determines according to it in step S9061, by it is such it is to be determined under
Carry and determined in link and capture target download link.
Here, capture apparatus 1 shown in Fig. 9 is in step S901, step S902, step S903, step S904 and step
Function in S905 is with the above capture apparatus 1 described by reference picture 8 in step S801, step S802, step S803, step
S804 is identical with the content in step S805, for simplicity, it is incorporated herein by reference, without repeating.
Download link to be determined and determination are determined to capture apparatus 1 and capture target download link referring to Fig. 9
Process is described in detail:
In one example, when the whole that the target information page includes is linked as h1, h2 and h3, in step S9061, crawl
Equipment 1 carries out string matching according to the URL of this three links respectively with predetermined keyword " .sis " and " .jar ", so as to
Determine that it is to be determined to determine h1 and h2 comprising keyword " .jar " is included in keyword " .sis " and h3 URL in h1 URL
Download link;Then, in step S9062, capture apparatus 1 obtains this two links according to download link h1 and h2 to be determined
Corresponding downloading data bag, and the header files of two downloading data bags is read to judge whether it is binary data packets, when
When judging that downloading data bag is binary data packets corresponding to download link h1, it is determined that download link h1 is that target downloads chain
Connect, so as to capture the download link h1 text message and URL, and be stored in the data repository of capture apparatus 1.
Those skilled in the art will be understood that the mode of above-mentioned determination download link to be determined and/or determination and capture target
The mode of download link is only for example, the mode of other determination that is existing or being likely to occur from now on download links to be determined and/
Or determine and capture the mode of target download link to be such as applicable to the present invention, it should also be included within the scope of the present invention,
And it is incorporated herein by reference.
Figure 10 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention.Wherein,
The process includes step S10011 and step S10012.In step S10011, capture apparatus 1 is according to the Website topology
Information and predtermined category list, inquired about in whole links in the current root page, to obtain and described predetermined point
One or more matching links that classification in class list matches;In step S10012, capture apparatus 1 is by it in step
Selection one does not access link in the one or more matching links obtained in S10011, and obtains next layer of page of its sensing
Face.
Here, capture apparatus 1 shown in Figure 10 is in step S1002, step S1003, step S1004 and step S1005
Function is with the above capture apparatus 1 described by reference picture 7 in step S702, step S703, step S704 and step S705
Hold identical, for simplicity, it is incorporated herein by reference, without repeating.
Retouched in detail referring to Figure 10 to obtain matching link to capture apparatus 1 and obtain the process of the next layer of page
State:
Here, the target information page can be belonging respectively to different classification;For example, capture apparatus 1 for particular brand,
When the mobile terminal crawl target information page and target download link of specific model, then show the need for grabbing in predtermined category list
The particular brand and the class indication of specific model taken.
In one example, first, when wait capture Data web site for mobile terminal apply website is provided when, then in step
In S10011, capture apparatus 1 extracts whole link a1, a2, a3 and the a4 included in current root page B, and according to the net of website
Stand the matching characteristic information for including link URL in the page of layer where the current root page shown in topology information:
http://c.d.cn/wml/eqp/index/From=.* $,
The URL that a1, a2, a3 and a4 are linked with the whole is matched respectively, to determine and the matching characteristic information phase
That matches somebody with somebody is linked as a1, a3 and a4, and the text message for extracting link a1 Anchor Text is " Nokia N8 applications ", links a3 anchor
The text message of text is " Nokia E7 applications " and the text message of link a4 Anchor Text is " LG 6660 is applied ";Then,
In step S10012, capture apparatus 1 is according to the brand class indication and corresponding type classification shown in predtermined category list
It is identified as and " LG 6660 " and " Nokia E7 ", character string is carried out respectively with being linked as a1, a3 and a4 text message of Anchor Text
Matching, the matching to match with the brand class indication and type classification mark of acquisition are linked as a3 and a4;Then, grab
Taking equipment 1 is linked by matching and selects not access link a4 in a3 and a4 according to access list, and to the website service of the website
Device sends acquisition link a4 and points to next layer of page access request of the next layer of page, and receives under Website server return
One layer of page.
Here, the Anchor Text means Anchor Text link, i.e. hypertext link, it establishes text key word and URL chains
The relation connect.
Here, it should be noted that example of the matching characteristic information as illustration in embodiment, only for understanding this
Invention, matching characteristic information during not as practical application.Unless otherwise instructed, the matching characteristic occurred elsewhere
The function of information is with where like, for simplicity, repeats no more.
Those skilled in the art will be understood that the mode of above-mentioned acquisition matching link and/or obtain the mode of the next layer of page
It is only for example, the side for the next layer of page of mode and/or acquisition that other acquisition matchings that are existing or being likely to occur from now on link
Formula is such as applicable to the present invention, should also be included within the scope of the present invention, and be incorporated herein by reference.
Figure 11 shows the method flow diagram for being used to capture website data according to further embodiment of the present invention.Wherein,
The process also includes step S11041 and step S11042.When capture apparatus judges the next layer of page for mesh in step S1102
When marking information page, in step S11041, capture apparatus 1 is by it compared with having captured in page info;When this is next
The layer page is not present in described when having captured in page info, and in step S11042, it is the mesh that capture apparatus 1, which is captured,
Mark information page.
Here, capture apparatus 1 shown in Figure 11 is in step S1101, step S1102, step S1103 and step S1105
Function is with the above capture apparatus 1 described by reference picture 7 in step S701, step S702, step S703 and step S705
Hold identical, for simplicity, it is incorporated herein by reference, without repeating.
Here, the page info that captured includes but is not limited to:
The identification information of-target information the page captured, such as the URL or mark ID or identification name of the target information page
Claim;
The characteristic information of-target information the page captured;
It, which may be present in, has captured in database, wherein, it is described captured database include but is not limited to relational database,
Key-Value storage systems or file system etc..
In one example, when judging the next layer of page for the target information page, in step S11041, capture apparatus 1
Parse the making language document of the target information page, such as the precalculated position by the html file in the target information page
Read HTML annotation informations;Wherein, the annotation information is the mark ID of the target information page, then, according to mark ID,
Capture and be compared in the page info of crawl of database purchase, determine that mark ID is not present in having captured page info
In, then in step S11042, it is the target information page that capture apparatus 1, which is captured, and the data for being stored in capture apparatus 1 are deposited
In bank, to realize that incremental data captures.
Preferably (reference picture 11), the process also include step S1107 (not shown), in step S1107, capture apparatus
The 1 target information page captured according to it in step S11042, is preserved or renewal has captured page info.
Specifically, in step S1107, the target information page that capture apparatus 1 has captured it in step S11042
Identification information, be written to and captured in database, page info has been captured to preserve or update;If detecting, this has captured number
Do not set up according to storehouse, then initialize this in advance and captured database, then the identification information of the target information page is written to and grabbed
Take in database.
In one example, in step S1107, capture apparatus 1 according to the mark ID of its target information page captured,
A data record for including mark ID is inserted in database has been captured, page info has been captured to preserve or update.
Those skilled in the art will be understood that above-mentioned preservation or renewal have captured the mode of page info and be only for example, other
The mode that preservation or renewal existing or be likely to occur from now on have captured page info is such as applicable to the present invention, should also include
Within the scope of the present invention, and it is incorporated herein by reference herein.
It should be noted that the present invention can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With application specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, software program of the invention can realize steps described above or function by computing device.Similarly, it is of the invention
Software program (including related data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory,
Magnetically or optically driver or floppy disc and similar devices.In addition, some steps or function of the present invention can employ hardware to realize, example
Such as, coordinate as with processor so as to perform the circuit of each step or function.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit requires rather than described above limits, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.This
Outside, it is clear that the word of " comprising " one is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple
Unit or device can also be realized by a unit or device by software or hardware.The first, the second grade word is used for table
Show title, and be not offered as any specific order.
Claims (10)
1. a kind of computer implemented method for capturing website data, this method comprise the following steps:
A does not access link, and obtain according to the Website topology information by whole link selections one in the current root page
The next layer of page for taking it to point to;
B judges whether the next layer of page is the target information page according to the first pre-defined rule;
C1 is not the target information page when the next layer of page, then using the next layer of page as the current root page, repeats
The step a and b is performed, until meeting the first predetermined condition;
C2 captures the target information page when judging the next layer of page for the target information page;
Wherein, this method also includes:
- when meeting the second predetermined condition, using previous root page face as the current root page, repeat described step a, b, c1 and
c2;
Wherein, the step a includes:
- according to the Website topology information and predtermined category list, enter in whole links in the current root page
Row inquiry, matched and linked with the one or more obtained with the classification in the predtermined category list matches;
- by it is one or more of matching link in selection one do not access link, and obtain its point to the next layer of page.
2. according to the method for claim 1, wherein, methods described also includes:
Y is according to the second pre-defined rule, by being determined in whole links of the target information page and capturing target download link.
3. according to the method for claim 2, wherein, the step y includes:
- according to the second pre-defined rule, determine download link to be determined in the whole links included from the target information page;
- according to corresponding to the download link to be determined downloading data bag, by determining and capturing in the download link to be determined
The target download link.
4. according to the method in any one of claims 1 to 3, wherein, the step c2 includes:
- when judging the next layer of page for the target information page, by it compared with having captured in page info;
- when the next layer of page is not present in described captured in page info, it is the target information page to be captured.
5. according to the method for claim 4, wherein, this method also includes:
Page info has been captured described in the target information page that-basis has captured, preservation or renewal.
6. a kind of equipment for capturing website data, the equipment includes:
First acquisition device, for according to the Website topology information, one to be selected by whole links in the current root page
It is individual not access link, and obtain the next layer of page of its sensing;
Judgment means, for according to the first pre-defined rule, judging whether the next layer of page is the target information page;
First circulation device, the next layer of page is judged not for the target information page for working as, then by the next layer of page
As the current root page, the operation of first acquisition device and the judgment means is repeated, until meeting that first is predetermined
Condition;
Wherein, the equipment also includes:
First grabbing device, for when judging the next layer of page for the target information page, capturing the target information page
Face;
Wherein, the equipment also includes:
Second circulation device, for when meeting the second predetermined condition, using previous root page face as the current root page, repeating institute
State the operation of the first acquisition device, the judgment means, the first circulation judgment means and first grabbing device;
Wherein, first acquisition device includes:
Second acquisition unit, for according to the Website topology information and predtermined category list, in the current root page
In whole links in inquired about, one or more matched with what the classification in the predtermined category list matched with obtaining
Link;
3rd acquiring unit, for not accessing link by selection one in one or more of matching links, and obtain it and refer to
To the next layer of page.
7. equipment according to claim 6, wherein, the equipment also includes:
Second grabbing device, for according to the second pre-defined rule, determining and grabbing in being linked by the whole of the target information page
Take target download link.
8. equipment according to claim 7, wherein, second grabbing device includes:
Determining unit is linked, for according to the second pre-defined rule, being determined in the whole links included from the target information page
Download link to be determined;
3rd placement unit, for the downloading data bag according to corresponding to the download link to be determined, by the download to be determined
Determined in link and capture the target download link.
9. the equipment according to any one of claim 6 to 8, wherein, first capture apparatus includes:
Comparing unit, for when judging the next layer of page for the target information page, by it with having captured in page info
It is compared;
3rd placement unit, it is for when the next layer of page is not present in described captured in page info, being captured
The target information page.
10. equipment according to claim 9, wherein, the equipment also includes:
Page info has been captured described in updating device, the target information page captured for basis, preservation or renewal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210030588.6A CN103246675B (en) | 2012-02-10 | 2012-02-10 | A kind of method and apparatus for being used to capture website data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210030588.6A CN103246675B (en) | 2012-02-10 | 2012-02-10 | A kind of method and apparatus for being used to capture website data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103246675A CN103246675A (en) | 2013-08-14 |
CN103246675B true CN103246675B (en) | 2018-01-12 |
Family
ID=48926199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210030588.6A Active CN103246675B (en) | 2012-02-10 | 2012-02-10 | A kind of method and apparatus for being used to capture website data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103246675B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106326225B (en) * | 2015-06-16 | 2019-09-17 | 阿里巴巴集团控股有限公司 | Page data acquisition method and device |
CN106547803B (en) * | 2015-09-23 | 2019-12-13 | 北京国双科技有限公司 | Method and device for crawling incremental resources of website |
CN105740363A (en) * | 2016-01-26 | 2016-07-06 | 上海晶赞科技发展有限公司 | Website target page discovery method and apparatus |
CN107544994B (en) * | 2016-06-27 | 2021-01-22 | 北京国双科技有限公司 | Associated data processing method and device |
CN110309389A (en) * | 2018-03-14 | 2019-10-08 | 北京嘀嘀无限科技发展有限公司 | Cloud computing system |
CN110633400A (en) * | 2018-06-06 | 2019-12-31 | 腾讯科技(北京)有限公司 | Webpage data capturing method and device, storage medium and electronic device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520798A (en) * | 2009-03-06 | 2009-09-02 | 苏州锐创通信有限责任公司 | Webpage classification technology based on vertical search and focused crawler |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040220954A1 (en) * | 2003-04-29 | 2004-11-04 | International Business Machines Corporation | Translation of data from a hierarchical data structure to a relational data structure |
-
2012
- 2012-02-10 CN CN201210030588.6A patent/CN103246675B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520798A (en) * | 2009-03-06 | 2009-09-02 | 苏州锐创通信有限责任公司 | Webpage classification technology based on vertical search and focused crawler |
Non-Patent Citations (1)
Title |
---|
网页抓取策略研究;翁岩青;《中国优秀硕士学位论文全文数据库 (电子期刊) 信息科技辑》;20110515(第05期);期刊第4章 * |
Also Published As
Publication number | Publication date |
---|---|
CN103246675A (en) | 2013-08-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103246675B (en) | A kind of method and apparatus for being used to capture website data | |
CN104881603B (en) | Webpage redirects leak detection method and device | |
CN102567407B (en) | Method and system for collecting forum reply increment | |
CN103294732B (en) | Webpage capture method and reptile | |
CN102436564A (en) | Method and device for identifying falsified webpage | |
CN102662966B (en) | Method and system for obtaining subject-oriented dynamic page content | |
CN103618696B (en) | Method and server for processing cookie information | |
CN103685604B (en) | A kind of domain name pre-parsed method and device | |
CN103279567A (en) | Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language) | |
CN103678321A (en) | Webpage element determination method and device and user behavior route determination method and device | |
CN102870118B (en) | Access method, device and system to user behavior | |
CN102200980A (en) | Method and system for providing network resources | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN107357903A (en) | User behavior data integration method, device and electronic equipment | |
CN103838862B (en) | Video searching method, device and terminal | |
CN108632219A (en) | A kind of website vulnerability detection method, detection service device and system | |
CN106209487B (en) | For detecting the method and device of the security breaches of webpage in website | |
CN108881138A (en) | A kind of web-page requests recognition methods and device | |
Reddy et al. | Preprocessing the web server logs: an illustrative approach for effective usage mining | |
CN104123311B (en) | A kind of data traffic reminding method and device | |
CN104199893A (en) | System and method for publishing omnimedia contents fast | |
CN102314494A (en) | Method and equipment for processing webpage contents | |
CN106547803B (en) | Method and device for crawling incremental resources of website | |
CN108874802A (en) | Page detection method and device | |
CN107562936A (en) | A kind of crawl of web page news list based on Jsoup and store method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |