CN107045507B

CN107045507B - Webpage crawling method and device

Info

Publication number: CN107045507B
Application number: CN201610082183.5A
Authority: CN
Inventors: 李可欣
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2016-02-05
Filing date: 2016-02-05
Publication date: 2020-08-21
Anticipated expiration: 2036-02-05
Also published as: CN107045507A

Abstract

The invention discloses a webpage crawling method and device, relates to the technical field of data processing, and improves crawling efficiency of a specific link webpage. The main technical scheme of the invention is as follows: a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled; acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rules are used for restricting links to be crawled in a page corresponding to the URL by the crawler program; extracting a link matched with the region restriction rule from a page corresponding to the URL; and crawling a webpage corresponding to the extracted link. The method and the device are mainly used for crawling the webpage data.

Description

Webpage crawling method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a webpage crawling method and device.

Background

The crawler is a program for people to visually describe a computer program to continuously extract links of a webpage through a customized web address, and to grab other deeper unknown links according to the links, so that the program grabbing is shaped like a crawler, which is called a crawler and is a program for automatically acquiring webpage content.

At present, if a crawler needs to crawl some specific links in a web page, for example, crawling links related to news content on a newwave homepage, the existing crawler will extract all the links in the newwave homepage, then make a special mark on the links belonging to the news content, after making a special mark, crawl the web page content corresponding to all the links in the newwave homepage, and finally retrieve the web page content corresponding to the links with the special mark, so as to implement crawling some specific links in the web page, and therefore the efficiency of crawling the content corresponding to the specific links in the web page is low.

Disclosure of Invention

The present invention has been made in view of the above problems, and aims to provide a web page crawling method and apparatus that overcome the above problems or at least partially solve the above problems.

In order to achieve the purpose, the invention mainly provides the following technical scheme:

in one aspect, an embodiment of the present invention provides a method for crawling a web page, where the method includes:

a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled;

acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rules are used for restricting links to be crawled in a page corresponding to the URL by the crawler program;

extracting a link matched with the region restriction rule from a page corresponding to the URL;

and crawling a webpage corresponding to the extracted link.

On the other hand, an embodiment of the present invention further provides a web page crawling apparatus, including:

the system comprises a receiving unit, a crawling unit and a crawling unit, wherein the receiving unit is used for receiving a crawler task which comprises a URL (uniform resource locator) of a page to be crawled;

the acquisition unit is used for acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rule is used for restricting links to be crawled in a page corresponding to the URL by the crawler;

the extracting unit is used for extracting a link matched with the region restriction rule from a page corresponding to the URL;

and the crawling unit is used for crawling the webpage corresponding to the extracted link.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

according to the webpage crawling method and device provided by the embodiment of the invention, a crawler program firstly receives a crawler task, the crawler task comprises a URL (uniform resource locator) of a page to be crawled, then a region restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from a preset rule table, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting links to be crawled in the page corresponding to the URL by the crawler program, then the links matched with the region restriction rule are extracted from the page corresponding to the URL, and finally the webpage corresponding to the extracted links is crawled. Compared with the prior art that the links needing to be crawled in the webpage are specially marked, and the webpage contents corresponding to the specially marked links are retrieved from the webpage contents corresponding to all the crawled links, the embodiment of the invention firstly obtains the area restriction rule corresponding to the URL matching rule successfully matched with the current URL from the preset rule table after receiving the crawler task, then extracts the links matched with the area restriction rule from the page corresponding to the URL, and finally crawls the webpage corresponding to the extracted links.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a web page crawling method according to an embodiment of the present invention;

fig. 2 is a flowchart of another web page crawling method according to an embodiment of the present invention;

FIG. 3 is a block diagram illustrating a web page crawling apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of another web page crawling apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to make the advantages of the technical solutions of the present invention clearer, the present invention is described in detail below with reference to the accompanying drawings and examples.

An embodiment of the present invention provides a method for crawling a web page, as shown in fig. 1, the method includes:

101. the crawler program receives a crawler task.

And the crawler task comprises a URL (uniform resource locator) of a page to be crawled.

102. And acquiring the area restriction rule corresponding to the URL matching rule successfully matched with the URL from a preset rule table.

The crawler-based information processing method comprises the steps that a preset rule table is stored, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one area limiting rule, and the area limiting rules are used for limiting links to be crawled in a page corresponding to a URL by a crawler program. It should be noted that the URL matching rule stored in the preset rule table and the area restriction rule corresponding to the URL matching rule are preset according to the actual requirements of the user, and are used for matching the URL in the crawler task. The URL matching rule includes a matching type and matching content, the matching type may specifically be left matching, right matching, including regular matching, and the like, and the matching content may be a character string or a regular expression. The region restriction rule may specifically be a path expression.

For example, the URL in the crawler task is http:// www.sample.com/picture/123.html, and the URL matching rules in the preset rule table include the following rules: left match, http:// www.sample.com/picture; left match, http:// www.sample.com/news; left match, http:// www.sample.com/weather. Matching the URL in the crawler task with the URL matching rule in the preset rule table to find that the URL in the crawler task and the URL matching rule: left match, http:// www.sample.com/picture match was successful.

103. And extracting a link matched with the region restriction rule from a page corresponding to the URL.

For the embodiment of the invention, after the area restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from the preset rule table, the area restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from the preset rule table. The area restriction rule may specifically be a path expression, and may also be in a form of a combination of a matching type and matching content, which is not specifically limited in the embodiment of the present invention.

For example, the crawler task URL is http:// news. sina. com. cn/c/nd/? qq-pf-to is pcqq.c2c, and the URL matching rule which is obtained from the preset rule table and matched with the URL of the crawler task is as follows: left match, http:// news. sina. com. cn. And the area restriction rule corresponding to the URL matching rule in the preset rule table is as follows: left match, http:// blog. And extracting a link matched with the area restriction rule from the page corresponding to the URL, namely extracting a link capable of left matching a path expression http:// blog.

104. And crawling a webpage corresponding to the extracted link.

In the embodiment of the invention, after a crawler program receives a crawler task, firstly, the area restriction rule corresponding to the URL matching rule successfully matched with the current URL is obtained from the preset rule table, then, the link matched with the area restriction rule is extracted from the page corresponding to the URL, and finally, the webpage corresponding to the extracted link is crawled.

An embodiment of the present invention provides another web page crawling method, as shown in fig. 2, the method includes:

201. the crawler program receives a crawler task.

202. And judging whether the crawler task is provided with an area crawling limiting function or not.

203. If so, extracting the domain name of the URL.

For example, the crawler task URL is http:// www.sample.com/123.html, and the fetched Domain is www.sample.com.

204. And acquiring the domain name matched with the domain name of the URL from the preset rule table.

It should be noted that, because of the diversity of URLs in the crawler tasks and the fact that the web pages under the same domain name basically belong to one style, the embodiment of the present invention uses the domain name as a primary index. If the domain name is not used as the primary index, all URL matching rules are required to be matched for each webpage needing region limitation. This will inevitably result in resource waste of the crawler system, and the running speed will be affected. Therefore, the domain names to which the URLs belong are classified, when the URLs in the crawling task are crawled in a limited region, all URL rule items under the corresponding domain name indexes can be found in the preset rule table through the extracted domain names, and then the URLs of the current crawling task are matched according to all the URL rule items under the domain name indexes. And all URL matching rules do not need to be matched, so that the method and the device improve the speed of crawling data.

For the embodiment of the present invention, the method further includes: and configuring data in the preset rule table, wherein the preset rule table stores a plurality of domain names, each domain name at least corresponds to one URL rule, and each URL rule at least corresponds to one region restriction rule. The region restriction rule is used for restricting the links to be crawled by the crawler program in the page corresponding to the URL. Since there may be multiple region restriction rules corresponding to one URL matching rule, there may be multiple URL matching rules under the same domain name. A rule pattern corresponding to a Domain name is expressed in JSON format, as shown below, where Domain1 and Domain2 denote Domain names, URL matching 1 denotes URL matching rules, and XPath1 and XPath2 denote region restriction rules under URL matching 1.

205. And acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name.

206. And carrying out deduplication processing on the extracted link.

In the embodiment of the invention, as a plurality of area restriction rules can be corresponded to one URL matching rule, when the URL matching rule corresponds to a plurality of area restriction rules, the extracted link is repeated, and at the moment, the extracted link needs to be subjected to de-duplication processing to ensure that the extracted link is not repeated, thereby avoiding the situation that a crawler repeatedly crawls webpage data.

207. And crawling a webpage corresponding to the duplicate-removed link.

For the embodiment of the invention, after a crawler program receives a crawler task, whether a region crawling limiting function is set in the crawler task is judged firstly, if the region crawling limiting function is set, a domain name of a URL in the crawler task is extracted, then the domain name matched with the domain name of the URL is obtained from a preset rule table, then a region limiting rule corresponding to the URL rule successfully matched with the URL is obtained from a URL matching rule corresponding to the obtained domain name, finally, the extracted link is subjected to duplication removal processing, and a webpage corresponding to the duplicate-removed link is crawled. Therefore, the crawling work of the web page contents corresponding to the special links in the web page is realized by the method, the contents corresponding to all the links in the web page do not need to be crawled, and only the links meeting the rules in the preset rule table need to be crawled, so that the crawling efficiency of the web page with the special links is improved.

According to the webpage crawling method provided by the embodiment of the invention, a crawler program firstly receives a crawler task, the crawler task comprises a URL (uniform resource locator) of a page to be crawled, then a region restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from a preset rule table, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting a link to be crawled in the page corresponding to the URL by the crawler program, then the link matched with the region restriction rule is extracted from the page corresponding to the URL, and finally the webpage corresponding to the extracted link is crawled. Compared with the prior art that the links needing to be crawled in the webpage are specially marked, and the webpage contents corresponding to the specially marked links are retrieved from the webpage contents corresponding to all the crawled links, the embodiment of the invention firstly obtains the area restriction rule corresponding to the URL matching rule successfully matched with the current URL from the preset rule table after receiving the crawler task, then extracts the links matched with the area restriction rule from the page corresponding to the URL, and finally crawls the webpage corresponding to the extracted links.

Further, an embodiment of the present invention provides a web page crawling apparatus, as shown in fig. 3, the apparatus includes: a receiving unit 31, an obtaining unit 32, an extracting unit 33, and a crawling unit 34.

The receiving unit 31 is configured to receive a crawler task, where the crawler task includes a URL of a page to be crawled;

an obtaining unit 32, configured to obtain, from a preset rule table, a region restriction rule corresponding to a URL matching rule that the URL is successfully matched, where the preset rule table stores multiple URL matching rules, each URL matching rule corresponds to at least one region restriction rule, and the region restriction rule is used to restrict a link to be crawled by the crawler program in a page corresponding to the URL;

an extracting unit 33, configured to extract a link matching the region restriction rule from a page corresponding to the URL;

and the crawling unit 34 is used for crawling the webpage corresponding to the extracted link.

It should be noted that, for other corresponding descriptions of the functional units related to the web page crawling apparatus provided in the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 1, which are not described herein again, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the foregoing method embodiments.

Further, another apparatus for crawling a web page is provided in an embodiment of the present invention, as shown in fig. 4, the apparatus includes: a receiving unit 41, an acquiring unit 42, an extracting unit 43, and a crawling unit 44.

A receiving unit 41, configured to receive a crawler task, where the crawler task includes a URL of a page to be crawled;

an obtaining unit 42, configured to obtain, from a preset rule table, a region restriction rule corresponding to a URL matching rule that the URL is successfully matched, where the preset rule table stores multiple URL matching rules, each URL matching rule corresponds to at least one region restriction rule, and the region restriction rule is used to restrict a link to be crawled in a page corresponding to the URL by the crawler;

an extracting unit 43, configured to extract a link matching the region restriction rule from a page corresponding to the URL;

and the crawling unit 44 is used for crawling the webpage corresponding to the extracted link.

For the embodiment of the invention, a plurality of domain names are stored in the preset rule table, each domain name at least corresponds to one URL matching rule,

the extracting unit 43 is further configured to extract a domain name of the URL;

the obtaining unit 42 is further configured to obtain a domain name matched with the domain name of the URL from the preset rule table;

the obtaining unit 42 is specifically configured to obtain, from the obtained URL matching rule corresponding to the domain name, an area restriction rule corresponding to the URL rule that the URL matching is successful.

In an embodiment of the present invention, the apparatus further includes: a judgment unit 45;

the judging unit 45 is configured to judge whether the crawler task sets an area crawling limit function;

the extracting unit 43 extracts the domain name of the URL if the area crawling limit function is set specifically by using a crawler task.

The deduplication unit 46 is configured to perform deduplication processing on the extracted connection;

the crawling unit 44 is specifically configured to crawl a webpage corresponding to the duplicate-removed link.

It should be noted that, for other corresponding descriptions of functional units related to a web page crawling apparatus provided in the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 2, which are not described herein again, but it should be clear that the apparatus in the embodiment can correspondingly implement all contents in the foregoing method embodiments.

According to the webpage crawling device provided by the embodiment of the invention, a crawler program firstly receives a crawler task, the crawler task comprises a URL of a page to be crawled, then a region restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from a preset rule table, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting a link to be crawled in the page corresponding to the URL by the crawler program, then the link matched with the region restriction rule is extracted from the page corresponding to the URL, and finally the webpage corresponding to the extracted link is crawled. Compared with the prior art that the links needing to be crawled in the webpage are specially marked, and the webpage contents corresponding to the specially marked links are retrieved from the webpage contents corresponding to all the crawled links, the embodiment of the invention firstly obtains the area restriction rule corresponding to the URL matching rule successfully matched with the current URL from the preset rule table after receiving the crawler task, then extracts the links matched with the area restriction rule from the page corresponding to the URL, and finally crawls the webpage corresponding to the extracted links.

The webpage crawling device comprises a processor and a memory, the receiving unit, the acquiring unit, the extracting unit, the crawling unit, the judging unit, the de-duplication unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the crawling efficiency of the specific link content in the webpage is improved by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled; acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rules are used for restricting links to be crawled in a page corresponding to the URL by the crawler program; extracting a link matched with the region restriction rule from a page corresponding to the URL; and crawling a webpage corresponding to the extracted link.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for crawling a web page, comprising:

obtaining a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of domain names are stored in the preset rule table, each domain name at least corresponds to one URL matching rule, the URL matching rule comprises a matching type and a matching content, the matching type is at least left matching and right matching, the matching content can be a character string or a regular expression, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting links to be crawled in a page corresponding to the URL by the crawler program, and the region restriction rule can be a path expression;

the obtaining of the area restriction rule corresponding to the URL matching rule that the URL matching is successful from the preset rule table specifically includes: extracting the domain name of the URL in the crawler task, acquiring the domain name matched with the domain name of the URL from the preset rule table, and acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name;

and crawling a webpage corresponding to the extracted link.

2. The method according to claim 1, wherein a plurality of domain names are further stored in the preset rule table, each domain name corresponds to at least one URL matching rule, and before the region restriction rule corresponding to the URL rule that the URL matching is successful is obtained from the preset rule table, the method further comprises:

extracting the domain name of the URL;

acquiring a domain name matched with the domain name of the URL from the preset rule table;

the obtaining of the area restriction rule corresponding to the URL rule successfully matched with the URL from the preset rule table includes:

and acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name.

3. The method of claim 2, wherein after the crawler program receives the crawler task, the method further comprises:

judging whether the crawler task is provided with an area crawling limiting function or not;

the extracting the domain name of the URL comprises:

and if the crawler task sets a region crawling limit function, extracting the domain name of the URL.

4. The method according to any one of claims 1 to 3, wherein after extracting the link matching the region restriction rule from the page corresponding to the URL, the method further comprises:

carrying out de-duplication processing on the extracted link;

the crawling of the webpage corresponding to the extracted link comprises the following steps:

and crawling a webpage corresponding to the duplicate-removed link.

5. A web page crawling apparatus, comprising:

an obtaining unit, configured to obtain, from a preset rule table, a region restriction rule corresponding to a URL matching rule that the URL matching is successful, where the preset rule table stores a plurality of domain names, each domain name corresponds to at least one URL matching rule, the URL matching rule includes a matching type and matching content, the matching type is at least left matching and right matching, the matching content may be a character string or a regular expression, each URL matching rule corresponds to at least one region restriction rule, the region restriction rule is used to restrict a link to be crawled by the crawler in a page corresponding to the URL, and the region restriction rule may be a path expression;

wherein, the obtaining unit is further specifically configured to: extracting the domain name of the URL in the crawler task, acquiring the domain name matched with the domain name of the URL from the preset rule table, and acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name;

6. The apparatus of claim 5, wherein the preset rule table further stores a plurality of domain names, each domain name corresponding to at least one URL matching rule,

the extracting unit is further configured to extract the domain name of the URL;

the obtaining unit is further configured to obtain a domain name matched with the domain name of the URL from the preset rule table;

the obtaining unit is specifically configured to obtain, from the obtained URL matching rule corresponding to the domain name, an area restriction rule corresponding to the URL rule that the URL matching is successful.

7. The apparatus of claim 6, further comprising: a judgment unit;

the judgment unit is used for judging whether the crawler task is provided with an area crawling limit function or not;

the extracting unit is specifically configured to extract the domain name of the URL if the crawler task sets a region crawling limit function.

8. The apparatus of any of claims 5-7, further comprising: a deduplication unit;

the duplication removing unit is used for carrying out duplication removing processing on the extracted connection;

and the crawling unit is specifically used for crawling the webpage corresponding to the duplicate-removed link.

9. A storage medium, comprising a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the web page crawling method according to any one of claims 1 to 4.

10. A processor, configured to execute a program, wherein the program executes the web page crawling method according to any one of claims 1 to 4.