CN110020064A

CN110020064A - The crawling method and device of webpage

Info

Publication number: CN110020064A
Application number: CN201710591483.0A
Authority: CN
Inventors: 邢琰
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2019-07-16

Abstract

The invention discloses a kind of crawling method of webpage and devices.Wherein, this method comprises: sub-pages URL under crawling the root URL of targeted website；Sub-pages URL is judged to obtain the first judging result using first set and second set, wherein first set is for judging whether the corresponding webpage of sub-pages URL is catalogue page, and second set is for judging whether the corresponding webpage of sub-pages URL is content pages；Sub-pages URL is judged using predetermined class to obtain the second judging result, wherein it is catalogue page or content pages that the second judging result, which is used to indicate sub-pages URL,；If the first judging result is consistent with the second judging result, continue to crawl sub-pages URL according to the first judging result；If the first judging result and the second judging result are inconsistent, sub-pages URL is recorded.It solves the technical issues of low efficiency of determining website URL rule, improves the efficiency of processing.

Description

The crawling method and device of webpage

Technical field

The present invention relates to internet areas, in particular to the crawling method and device of a kind of webpage.

Background technique

Web crawlers is a kind of program for automatically grabbing WWW specific information according to certain rules, is climbed actual In worm system, often there is the limitation for crawling depth and judge that (content pages are that crawler will crawl for the requirement of content pages or catalogue page An article on the page, such as website, report etc.；Catalogue page is the index of content pages, each link of catalogue page is directed toward One content pages).Judge that the logical comparison of content pages and catalogue page is simple in crawler system, but the developer couple of website The design of URL structure has very big uncertainty (for example, relevant URL structure: http://www.ccszf.gov.cn/ Ccszf/1/tindex.shtml), as soon as a same URL structure is content pages in website, and in another website It can not include all websites to the judgement of content pages and catalogue page in crawler system for catalogue page, therefore, the relevant technologies warp The case where often will appear content pages and catalogue page misjudgment.In the case where misjudgment, the system of will lead to makes mistake Processing (for example, if by content pages false judgment be catalogue page, reach crawl the maximum value of depth when will abandon catalogue Page, is no longer crawled, content pages is caused to lack；And if being content pages by catalogue page false judgment, content pages are will not to take out Link is taken, the link of catalogue page is thus lost).In addition to this, in the related art, crawler system, which is treated, crawls website The judgement of URL rule be to be added by artificial mode, the specifically artificial screening URL that will not be inconsistent website normally It is added in a predetermined class, in the program operation of crawler system, the judging result that can be provided by predetermined class obtains being interior Hold page or catalogue page.This pure artificial mode needs user to remove a little to open the webpage of all URL structures in website and to it Carry out artificial judgment.After finding rule, it is also necessary to which manually addition judges sentence in predetermined class.Due to the URL of each website Rule is not quite similar, so the rule of addition is also different.The number of seed, which once increases, will greatly increase workload.

For the low efficiency problem of above-mentioned determination website URL rule, currently no effective solution has been proposed.

Summary of the invention

The embodiment of the invention provides a kind of crawling method of webpage and devices, at least to solve to determine website URL rule Low efficiency the technical issues of.

According to an aspect of an embodiment of the present invention, a kind of crawling method of webpage is provided, comprising: crawl targeted website Root URL under sub-pages URL；The sub-pages URL is judged to obtain first using first set and second set and is sentenced Disconnected result, wherein the first set is for judging whether the corresponding webpage of the sub-pages URL is catalogue page, described second Set is for judging whether the corresponding webpage of the sub-pages URL is content pages；The sub-pages URL is carried out using predetermined class Judgement obtain the second judging result, wherein second judging result be used to indicate the sub-pages URL be the catalogue page or Content pages described in person；If first judging result is consistent with second judging result, tied according to first judgement Fruit continues to crawl the sub-pages URL；If first judging result and second judging result are inconsistent, Record the sub-pages URL.

Optionally, the sub-pages URL is judged to obtain the first judging result using first set and second set It include: to judge that the sub-pages URL is intercepted or intercepted by the second set by the first set；If the sub-pages URL is intercepted by the first set, it is determined that the corresponding webpage of the sub-pages URL is the catalogue page；If the subnet Page URL is intercepted by the second set, it is determined that the corresponding webpage of the sub-pages URL is the content pages.

Optionally, after recording the sub-pages URL, the method also includes: according to first judging result pair The sub-pages URL is crawled.

Optionally, after recording the sub-pages URL, the method also includes: logic is added in Xiang Suoshu predetermined class Judgment rule, wherein the logic judgment rule is for judging the sub-pages URL for content pages or catalogue page；Using described The sub-pages URL is judged as classes of pages indicated by first judging result by the logic judgment rule in predetermined class Type, the page type are content pages or catalogue page.

Optionally, it includes: by character string template to the predetermined class that logic judgment rule is added in Xiang Suoshu predetermined class Logic judgment rule of the middle addition for the sub-pages URL；If can not be to the sub-pages by character string template URL adds the logic judgment rule, then will save the information of the sub-pages URL.

Optionally, the sub-pages URL is judged to obtain the first judging result using first set and second set Before, which comprises absolute sub-pages URL will be become with respect to sub-pages URL, wherein the opposite sub-pages are that have The webpage of whole domain names of sub-pages, absolute sub-pages refer to the sub-pages after removal root；Removal and absolute sub-pages URL The similar sub-pages URL of structure；Remove undesirable sub-pages URL.

Optionally, removing undesirable sub-pages URL includes: the different sub-pages URL of removal domain name and/or mistake Filter the sub-pages URL that can not be crawled.

According to another aspect of an embodiment of the present invention, additionally provide a kind of webpage crawls device, comprising: first crawls list Member, the sub-pages URL under root URL for crawling targeted website；First judging unit, for utilizing first set and the second collection Conjunction is judged to obtain the first judging result to the sub-pages URL, wherein the first set is for judging the sub-pages Whether the corresponding webpage of URL is catalogue page, and the second set is for judging whether the corresponding webpage of the sub-pages URL is interior Hold page；Second judgment unit, for being judged to obtain the second judging result to the sub-pages URL using predetermined class, wherein It is the catalogue page or the content pages that second judging result, which is used to indicate the sub-pages URL,；Second crawls unit, If consistent with second judging result for first judging result, continue according to first judging result to institute Sub-pages URL is stated to be crawled；Recording unit, if different for first judging result and second judging result It causes, then records the sub-pages URL.

Optionally, first judging unit includes: first judgment module, for judging that the sub-pages URL is described First set is intercepted or is intercepted by the second set；First determining module, if for the sub-pages URL by described the One set intercepts, it is determined that the corresponding webpage of the sub-pages URL is the catalogue page；Second determining module, if being used for institute It states sub-pages URL to be intercepted by the second set, it is determined that the corresponding webpage of the sub-pages URL is the content pages.

Optionally, described device further include: execution unit, after recording the sub-pages URL, for according to described the One judging result crawls the sub-pages URL.

In embodiments of the present invention, using the sub-pages URL under the root URL for crawling targeted website；Using first set and Second set judges sub-pages URL to obtain the first judging result, wherein first set is for judging URL pairs of sub-pages Whether the webpage answered is catalogue page, and second set is for judging whether the corresponding webpage of sub-pages URL is content pages；Using predetermined Class judges sub-pages URL to obtain the second judging result, wherein it is mesh that the second judging result, which is used to indicate sub-pages URL, Record page or content pages；If the first judging result is consistent with the second judging result, continue antithetical phrase according to the first judging result Webpage URL is crawled；If the first judging result and the second judging result are inconsistent, sub-pages URL is recorded.It solves really The technical issues of determining the low efficiency of website URL rule, improves the efficiency of processing.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is a kind of crawling method flow chart of webpage according to an embodiment of the present invention；

Fig. 2 is that another webpage according to an embodiment of the present invention crawls pretreated flow chart；

Fig. 3 is a kind of schematic diagram for crawling device of webpage according to an embodiment of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

According to embodiments of the present invention, a kind of crawling method embodiment of webpage is provided, it should be noted that in attached drawing The step of process illustrates can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow chart, but in some cases, it can be to be different from shown by sequence execution herein or retouch The step of stating.

Fig. 1 is a kind of flow chart of the crawling method of webpage according to an embodiment of the present invention, as shown in Figure 1, this method packet Include following steps:

Step S102 crawls the sub-pages URL under the root URL of targeted website；

Step S104 judges sub-pages URL to obtain the first judging result using first set and second set, In, first set is for judging whether the corresponding webpage of sub-pages URL is catalogue page, and second set is for judging sub-pages URL Whether corresponding webpage is content pages；

Step S106 judges sub-pages URL using predetermined class to obtain the second judging result, wherein the second judgement As a result being used to indicate sub-pages URL is catalogue page or content pages；

Step S108 continues pair if the first judging result is consistent with the second judging result according to the first judging result Sub-pages URL is crawled；

Step S110 records sub-pages URL if the first judging result and the second judging result are inconsistent.

Above-mentioned steps judge that webpage is catalogue page or interior using a series of judgment rules of first set and second set Hold page, is then judged by predetermined class, the result that the result of predetermined class judgement and first set or second set are judged Compare, if it is determined that the result is that consistent, then crawled according to the result, if it is judged that inconsistent, with the first collection Subject to the judging result of conjunction or second set, and prove that result is wrong this when, so result is kept records of special In the document of misregistration.The automatic webpage for searching mistake is realized through the above steps, and can directly be indicated correctly Webpage improves the efficiency of determining website rule.

During above-mentioned steps are crawled, in an optional embodiment, first set can be constantly used Or second set is judged, and compares the first judging result judged with first set or second set and with pre- The second judging result that class judges is determined, until reaching depth capacity until crawling.

Sub-pages are carried out to judge that specific embodiment can be using first set or second set in above-mentioned steps Include:

Judge that sub-pages URL is intercepted or intercepted by second set by first set；

If sub-pages URL is intercepted by first set, it is determined that the corresponding webpage of sub-pages URL is catalogue page；

If sub-pages URL is intercepted by second set, it is determined that the corresponding webpage of sub-pages URL is content pages.

It has specifically been determined that sub-pages are content pages or catalogue page through the above steps, has not needed manually in default class (RulesMatching) judged again in, efficiency has been turned up.

After recording sub-pages URL, it can also continue in an optional embodiment according to the first judging result Sub-pages URL is crawled.

By the step, temporarily distribution one accurately crawls rule, i.e., to use first set or second set Crawl subject to result, improve treatment effeciency, avoid waiting for.

It will judge to judge that the judgment rule of sub-pages is added in predetermined class using first set or second set, so as under It is secondary that sub-pages are determined using predetermined class, in an optional embodiment, after recording sub-pages URL, also wrap It includes: adding logic judgment rule into predetermined class, wherein logic judgment rule is used to be content pages or mesh to sub-pages URL Record page is judged；Sub-pages URL is judged as the first judging result using the logic judgment rule in predetermined class.

The judgment rule in predetermined class has been automatically updated through the above steps, can both guarantee the accuracy of judgement, it can also To improve efficiency.If judgement next time is consistent with judgement for the first time, the judgment rule in predetermined class is had no need to change, if not It is consistent then need to change the judgment rule in predetermined class.

To being screened in above-mentioned steps after problematic webpage records, the document of special misregistration is carried out such as Lower step, in an optional embodiment, if the document of special misregistration is not empty, that is, is asked in the presence of judgement The webpage of topic can carry out duplicate removal to sub-pages in the document, then be added into predetermined class by character string template for son The logic judgment rule of webpage URL；It, will if logic judgment rule can not be added to sub-pages URL by character string template The information of sub-pages URL is saved, manually to add.

Through the above steps, screening is exclusively carried out to problematic webpage, to more accurately determine the rule crawled, just In the new rule of increase, and new rule can be determined.

After determining content pages or catalogue page, in an optional embodiment, according to the first judging result Continue that sub-pages URL crawl include: when the corresponding webpage of the first judging result instruction sub-pages URL is catalogue page, Continue to crawl sub-pages URL according to corresponding first rule of catalogue page；URL pairs of sub-pages is indicated in the first judging result When the webpage answered is content pages, continue to crawl sub-pages URL according to the corresponding Second Rule of content pages.

Through the foregoing embodiment, sub-pages URL progress is crawled rationally effectively.

Fig. 2 is that another webpage according to an embodiment of the present invention crawls pretreated flow chart, as shown in Fig. 2, with the One set and second set can carry out sub-pages URL pre- before being judged to obtain the first judging result to sub-pages URL Processing, in an optional embodiment, the pretreated mode, comprising:

Step S201 determines the link set of webpage；

Step S203 will become absolute sub-pages URL with respect to sub-pages URL；

Step S205 removes sub-pages URL identical with absolute sub-pages URL structure；

Step S207 is removed, undesirable sub-pages URL is removed.

By the above process, similar sub-pages are removed, opposite sub-pages are whole domain names of sub-pages, absolute sub-pages Sub-pages after referring to removal root.The similar similar sub-pages of structure for referring to that removal is similar with absolute sub-pages of structure, Such as webpage 1:http: //www.abc/culture/0023.html；Webpage 2:http: //www.abc/education/ 2345.html；Webpage 1 is similar to 2 structure of webpage, and the same subnet can be more accurately filtered out by above embodiment Page, to improve accuracy.

Undesirable sub-pages URL is removed in the above process includes:

It removes the different sub-pages URL of domain name and/or filters out the sub-pages URL that can not be crawled.

Optionally, seed information, root Url are read from the table of database, maximum crawls depth MaximumDepth (crawler In be actually subjected to the depth crawled) etc. parameters.Maximum crawls depth for controlling the number for extracting root Url and its sub- Url link. Program can crawl the page of lower Url first, and program can carry out preliminary processing to the link extracted each time, so that all Link is all with Http: // or Https: then // beginning carries out preliminary screening, duplicate removal to the link of extraction.For example, going It, will be by preliminary screening containing .doc, the Url etc. of .rar containing the Url fewer than root Url number of segment except the Url of different domain names Url is for re-filtering again, enters two places and is judged: entering the program segment for judging content pages and catalogue page first, The program segment is that (one is the set for judging content pages to two set, the other is judging the set of catalogue page, each is entrusted It is to judge page type, the method that page type is content pages or catalogue page feature) it is said if Url is intercepted by which set It is bright it be catalogue page either content pages.Then by RulesMatching class judged as a result, if two result phases Same then explanation is correctly.If two result differences, it is subject into obtained result is gathered, records the Url and (need to add later Add the rule of the Url into RulesMatching class).Above-mentioned logic is executed always until arrival maximum crawls depth.Most Afterwards, Url will be screened and carry out duplicate removal.If collection is combined into sky, this illustrates that the content page directory page of the website judges that there is no problem. If having Url in set, need to be added each Url logic judgment rule.Rule if necessary to addition is common It can be added automatically by character string template, if can not add automatically, all information can be stored into file, so as to Artificial addition.In above process, after having judged Url is catalogue page or content pages, catalogue is then extracted if it is catalogue page Page, after extracting catalogue page every time, the depth minus one of extraction judges whether a depth is one, if depth is not first to return Continue to crawl to link set, after extracting catalogue page, judges depth for the moment, no longer to return to link set.

It is illustrated below with reference to an optional embodiment.

The judgement operation of content page directory page is carried out to the link after pretreatment in overview flow chart, reading database obtains Seed information；Kind of a subpage frame is crawled, link is extracted；Link is filtered, the processing such as duplicate removal.It is crawled by maximum deep-controlled Crawl depth.Obtain problematic Url；Logic judgment is added to problematic Url, the URL for being difficult to add automatically is stored To file, manually to add rule.

Device is crawled the embodiment of the invention also provides a kind of webpage.It should be noted that the one of the embodiment of the present invention The device that crawls of kind webpage can be used for executing a kind of crawling method of webpage provided by the embodiment of the present invention, and the present invention is implemented The crawling method of webpage of example a kind of can also through the embodiment of the present invention provided by a kind of webpage crawl device to execute.

Fig. 3 is a kind of schematic diagram for crawling device of webpage according to an embodiment of the present invention.As shown in figure 3, this crawls dress It sets and includes:

First crawls unit 32, the sub-pages URL under root URL for crawling targeted website；

First judging unit 34, for being judged to obtain first to sub-pages URL using first set and second set Judging result, wherein first set is for judging whether the corresponding webpage of sub-pages URL is catalogue page, and second set is for sentencing Whether the disconnected corresponding webpage of sub-pages URL is content pages；

Second judgment unit 36 obtains the second judging result for being judged using predetermined class sub-pages URL, In, it is catalogue page or content pages that the second judging result, which is used to indicate sub-pages URL,；

Second crawls unit 38, if consistent with the second judging result for the first judging result, according to the first judgement As a result continue to crawl sub-pages URL；

If recording unit 310 records sub-pages URL inconsistent for the first judging result and the second judging result.

In an optional embodiment, the first judging unit includes: first judgment module, for judging sub-pages URL is intercepted or intercepted by second set by first set；First determining module, if for sub-pages URL by first set It intercepts, it is determined that the corresponding webpage of sub-pages URL is catalogue page；Second determining module, if for sub-pages URL by the second collection It closes and intercepts, it is determined that the corresponding webpage of sub-pages URL is content pages.

In an optional embodiment, device further include: execution unit, for pressing after recording sub-pages URL Sub-pages URL is crawled according to the first judging result.

The automatic webpage for searching mistake is realized by above-mentioned apparatus, and can directly indicate correct webpage, is improved The efficiency of determining website rule.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of unit, can be one kind Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of unit or module, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of crawling method of webpage characterized by comprising

Crawl the sub-pages URL under the root URL of targeted website；

The sub-pages URL is judged to obtain the first judging result using first set and second set, wherein described One set is for judging whether the corresponding webpage of the sub-pages URL is catalogue page, and the second set is for judging the son Whether the corresponding webpage of webpage URL is content pages；

The sub-pages URL is judged to obtain the second judging result using predetermined class, wherein second judging result is used In indicating that the sub-pages URL is the catalogue page or the content pages；

If first judging result is consistent with second judging result, according to first judging result to the son Webpage URL is crawled；

If first judging result and second judging result are inconsistent, the sub-pages URL is recorded.

2. the method according to claim 1, wherein using first set and second set to the sub-pages URL is judged to obtain the first judging result

Judge that the sub-pages URL is intercepted or intercepted by the second set by the first set；

If the sub-pages URL is intercepted by the first set, it is determined that the corresponding webpage of the sub-pages URL is the mesh Record page；

If the sub-pages URL is intercepted by the second set, it is determined that the corresponding webpage of the sub-pages URL is in described Hold page.

3. the method according to claim 1, wherein the method is also wrapped after recording the sub-pages URL It includes:

The sub-pages URL is crawled according to first judging result.

4. the method according to claim 1, wherein the method is also wrapped after recording the sub-pages URL It includes:

Logic judgment rule is added into the predetermined class, wherein the logic judgment rule is for judging the sub-pages URL For content pages or catalogue page；

The sub-pages URL is judged as the first judging result institute using the logic judgment rule in the predetermined class The page type of instruction, the page type are content pages or catalogue page.

5. according to the method described in claim 4, it is characterized in that, addition logic judgment rule includes: into the predetermined class

The logic judgment rule for the sub-pages URL is added into the predetermined class by character string template；

If the logic judgment rule can not be added to the sub-pages URL by character string template, the son will be saved The information of webpage URL.

6. the method according to claim 1, wherein using first set and second set to the sub-pages URL is judged to obtain before the first judging result, which comprises

It will become absolute sub-pages URL with respect to sub-pages URL, wherein the opposite sub-pages are that have whole domains of sub-pages The webpage of name, absolute sub-pages refer to the sub-pages after removal root；

Remove sub-pages URL similar with absolute sub-pages URL structure；

Remove undesirable sub-pages URL.

7. according to the method described in claim 6, it is characterized in that, removing undesirable sub-pages URL and including:

8. a kind of webpage crawls device characterized by comprising

First crawls unit, the sub-pages URL under root URL for crawling targeted website；

First judging unit is sentenced for being judged to obtain first to the sub-pages URL using first set and second set Disconnected result, wherein the first set is for judging whether the corresponding webpage of the sub-pages URL is catalogue page, described second Set is for judging whether the corresponding webpage of the sub-pages URL is content pages；

Second judgment unit, the second judging result for being judged using predetermined class the sub-pages URL, In, it is the catalogue page or the content pages that second judging result, which is used to indicate the sub-pages URL,；

Second crawls unit, if consistent with second judging result for first judging result, according to described the One judging result continues to crawl the sub-pages URL；

Recording unit records the subnet if inconsistent for first judging result and second judging result Page URL.

9. device according to claim 8, which is characterized in that first judging unit includes:

First judgment module is still blocked by the second set for judging that the sub-pages URL is intercepted by the first set It cuts；

First determining module, if intercepted for the sub-pages URL by the first set, it is determined that the sub-pages URL Corresponding webpage is the catalogue page；

Second determining module, if intercepted for the sub-pages URL by the second set, it is determined that the sub-pages URL Corresponding webpage is the content pages.

10. device according to claim 9, which is characterized in that described device further include:

Execution unit, for after recording the sub-pages URL, according to first judging result to the sub-pages URL It is crawled.