CN107045507B - Webpage crawling method and device - Google Patents

Webpage crawling method and device Download PDF

Info

Publication number
CN107045507B
CN107045507B CN201610082183.5A CN201610082183A CN107045507B CN 107045507 B CN107045507 B CN 107045507B CN 201610082183 A CN201610082183 A CN 201610082183A CN 107045507 B CN107045507 B CN 107045507B
Authority
CN
China
Prior art keywords
url
rule
matching
crawling
domain name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610082183.5A
Other languages
Chinese (zh)
Other versions
CN107045507A (en
Inventor
李可欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610082183.5A priority Critical patent/CN107045507B/en
Publication of CN107045507A publication Critical patent/CN107045507A/en
Application granted granted Critical
Publication of CN107045507B publication Critical patent/CN107045507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage crawling method and device, relates to the technical field of data processing, and improves crawling efficiency of a specific link webpage. The main technical scheme of the invention is as follows: a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled; acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rules are used for restricting links to be crawled in a page corresponding to the URL by the crawler program; extracting a link matched with the region restriction rule from a page corresponding to the URL; and crawling a webpage corresponding to the extracted link. The method and the device are mainly used for crawling the webpage data.

Description

Webpage crawling method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a webpage crawling method and device.
Background
The crawler is a program for people to visually describe a computer program to continuously extract links of a webpage through a customized web address, and to grab other deeper unknown links according to the links, so that the program grabbing is shaped like a crawler, which is called a crawler and is a program for automatically acquiring webpage content.
At present, if a crawler needs to crawl some specific links in a web page, for example, crawling links related to news content on a newwave homepage, the existing crawler will extract all the links in the newwave homepage, then make a special mark on the links belonging to the news content, after making a special mark, crawl the web page content corresponding to all the links in the newwave homepage, and finally retrieve the web page content corresponding to the links with the special mark, so as to implement crawling some specific links in the web page, and therefore the efficiency of crawling the content corresponding to the specific links in the web page is low.
Disclosure of Invention
The present invention has been made in view of the above problems, and aims to provide a web page crawling method and apparatus that overcome the above problems or at least partially solve the above problems.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
in one aspect, an embodiment of the present invention provides a method for crawling a web page, where the method includes:
a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled;
acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rules are used for restricting links to be crawled in a page corresponding to the URL by the crawler program;
extracting a link matched with the region restriction rule from a page corresponding to the URL;
and crawling a webpage corresponding to the extracted link.
On the other hand, an embodiment of the present invention further provides a web page crawling apparatus, including:
the system comprises a receiving unit, a crawling unit and a crawling unit, wherein the receiving unit is used for receiving a crawler task which comprises a URL (uniform resource locator) of a page to be crawled;
the acquisition unit is used for acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rule is used for restricting links to be crawled in a page corresponding to the URL by the crawler;
the extracting unit is used for extracting a link matched with the region restriction rule from a page corresponding to the URL;
and the crawling unit is used for crawling the webpage corresponding to the extracted link.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
according to the webpage crawling method and device provided by the embodiment of the invention, a crawler program firstly receives a crawler task, the crawler task comprises a URL (uniform resource locator) of a page to be crawled, then a region restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from a preset rule table, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting links to be crawled in the page corresponding to the URL by the crawler program, then the links matched with the region restriction rule are extracted from the page corresponding to the URL, and finally the webpage corresponding to the extracted links is crawled. Compared with the prior art that the links needing to be crawled in the webpage are specially marked, and the webpage contents corresponding to the specially marked links are retrieved from the webpage contents corresponding to all the crawled links, the embodiment of the invention firstly obtains the area restriction rule corresponding to the URL matching rule successfully matched with the current URL from the preset rule table after receiving the crawler task, then extracts the links matched with the area restriction rule from the page corresponding to the URL, and finally crawls the webpage corresponding to the extracted links.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a web page crawling method according to an embodiment of the present invention;
fig. 2 is a flowchart of another web page crawling method according to an embodiment of the present invention;
FIG. 3 is a block diagram illustrating a web page crawling apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of another web page crawling apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to make the advantages of the technical solutions of the present invention clearer, the present invention is described in detail below with reference to the accompanying drawings and examples.
An embodiment of the present invention provides a method for crawling a web page, as shown in fig. 1, the method includes:
101. the crawler program receives a crawler task.
And the crawler task comprises a URL (uniform resource locator) of a page to be crawled.
102. And acquiring the area restriction rule corresponding to the URL matching rule successfully matched with the URL from a preset rule table.
The crawler-based information processing method comprises the steps that a preset rule table is stored, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one area limiting rule, and the area limiting rules are used for limiting links to be crawled in a page corresponding to a URL by a crawler program. It should be noted that the URL matching rule stored in the preset rule table and the area restriction rule corresponding to the URL matching rule are preset according to the actual requirements of the user, and are used for matching the URL in the crawler task. The URL matching rule includes a matching type and matching content, the matching type may specifically be left matching, right matching, including regular matching, and the like, and the matching content may be a character string or a regular expression. The region restriction rule may specifically be a path expression.
For example, the URL in the crawler task is http:// www.sample.com/picture/123.html, and the URL matching rules in the preset rule table include the following rules: left match, http:// www.sample.com/picture; left match, http:// www.sample.com/news; left match, http:// www.sample.com/weather. Matching the URL in the crawler task with the URL matching rule in the preset rule table to find that the URL in the crawler task and the URL matching rule: left match, http:// www.sample.com/picture match was successful.
103. And extracting a link matched with the region restriction rule from a page corresponding to the URL.
For the embodiment of the invention, after the area restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from the preset rule table, the area restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from the preset rule table. The area restriction rule may specifically be a path expression, and may also be in a form of a combination of a matching type and matching content, which is not specifically limited in the embodiment of the present invention.
For example, the crawler task URL is http:// news. sina. com. cn/c/nd/? qq-pf-to is pcqq.c2c, and the URL matching rule which is obtained from the preset rule table and matched with the URL of the crawler task is as follows: left match, http:// news. sina. com. cn. And the area restriction rule corresponding to the URL matching rule in the preset rule table is as follows: left match, http:// blog. And extracting a link matched with the area restriction rule from the page corresponding to the URL, namely extracting a link capable of left matching a path expression http:// blog.
104. And crawling a webpage corresponding to the extracted link.
In the embodiment of the invention, after a crawler program receives a crawler task, firstly, the area restriction rule corresponding to the URL matching rule successfully matched with the current URL is obtained from the preset rule table, then, the link matched with the area restriction rule is extracted from the page corresponding to the URL, and finally, the webpage corresponding to the extracted link is crawled.
An embodiment of the present invention provides another web page crawling method, as shown in fig. 2, the method includes:
201. the crawler program receives a crawler task.
And the crawler task comprises a URL (uniform resource locator) of a page to be crawled.
202. And judging whether the crawler task is provided with an area crawling limiting function or not.
203. If so, extracting the domain name of the URL.
For example, the crawler task URL is http:// www.sample.com/123.html, and the fetched Domain is www.sample.com.
204. And acquiring the domain name matched with the domain name of the URL from the preset rule table.
It should be noted that, because of the diversity of URLs in the crawler tasks and the fact that the web pages under the same domain name basically belong to one style, the embodiment of the present invention uses the domain name as a primary index. If the domain name is not used as the primary index, all URL matching rules are required to be matched for each webpage needing region limitation. This will inevitably result in resource waste of the crawler system, and the running speed will be affected. Therefore, the domain names to which the URLs belong are classified, when the URLs in the crawling task are crawled in a limited region, all URL rule items under the corresponding domain name indexes can be found in the preset rule table through the extracted domain names, and then the URLs of the current crawling task are matched according to all the URL rule items under the domain name indexes. And all URL matching rules do not need to be matched, so that the method and the device improve the speed of crawling data.
For the embodiment of the present invention, the method further includes: and configuring data in the preset rule table, wherein the preset rule table stores a plurality of domain names, each domain name at least corresponds to one URL rule, and each URL rule at least corresponds to one region restriction rule. The region restriction rule is used for restricting the links to be crawled by the crawler program in the page corresponding to the URL. Since there may be multiple region restriction rules corresponding to one URL matching rule, there may be multiple URL matching rules under the same domain name. A rule pattern corresponding to a Domain name is expressed in JSON format, as shown below, where Domain1 and Domain2 denote Domain names, URL matching 1 denotes URL matching rules, and XPath1 and XPath2 denote region restriction rules under URL matching 1.
Figure BDA0000923211240000051
Figure BDA0000923211240000061
205. And acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name.
206. And carrying out deduplication processing on the extracted link.
In the embodiment of the invention, as a plurality of area restriction rules can be corresponded to one URL matching rule, when the URL matching rule corresponds to a plurality of area restriction rules, the extracted link is repeated, and at the moment, the extracted link needs to be subjected to de-duplication processing to ensure that the extracted link is not repeated, thereby avoiding the situation that a crawler repeatedly crawls webpage data.
207. And crawling a webpage corresponding to the duplicate-removed link.
For the embodiment of the invention, after a crawler program receives a crawler task, whether a region crawling limiting function is set in the crawler task is judged firstly, if the region crawling limiting function is set, a domain name of a URL in the crawler task is extracted, then the domain name matched with the domain name of the URL is obtained from a preset rule table, then a region limiting rule corresponding to the URL rule successfully matched with the URL is obtained from a URL matching rule corresponding to the obtained domain name, finally, the extracted link is subjected to duplication removal processing, and a webpage corresponding to the duplicate-removed link is crawled. Therefore, the crawling work of the web page contents corresponding to the special links in the web page is realized by the method, the contents corresponding to all the links in the web page do not need to be crawled, and only the links meeting the rules in the preset rule table need to be crawled, so that the crawling efficiency of the web page with the special links is improved.
According to the webpage crawling method provided by the embodiment of the invention, a crawler program firstly receives a crawler task, the crawler task comprises a URL (uniform resource locator) of a page to be crawled, then a region restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from a preset rule table, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting a link to be crawled in the page corresponding to the URL by the crawler program, then the link matched with the region restriction rule is extracted from the page corresponding to the URL, and finally the webpage corresponding to the extracted link is crawled. Compared with the prior art that the links needing to be crawled in the webpage are specially marked, and the webpage contents corresponding to the specially marked links are retrieved from the webpage contents corresponding to all the crawled links, the embodiment of the invention firstly obtains the area restriction rule corresponding to the URL matching rule successfully matched with the current URL from the preset rule table after receiving the crawler task, then extracts the links matched with the area restriction rule from the page corresponding to the URL, and finally crawls the webpage corresponding to the extracted links.
Further, an embodiment of the present invention provides a web page crawling apparatus, as shown in fig. 3, the apparatus includes: a receiving unit 31, an obtaining unit 32, an extracting unit 33, and a crawling unit 34.
The receiving unit 31 is configured to receive a crawler task, where the crawler task includes a URL of a page to be crawled;
an obtaining unit 32, configured to obtain, from a preset rule table, a region restriction rule corresponding to a URL matching rule that the URL is successfully matched, where the preset rule table stores multiple URL matching rules, each URL matching rule corresponds to at least one region restriction rule, and the region restriction rule is used to restrict a link to be crawled by the crawler program in a page corresponding to the URL;
an extracting unit 33, configured to extract a link matching the region restriction rule from a page corresponding to the URL;
and the crawling unit 34 is used for crawling the webpage corresponding to the extracted link.
It should be noted that, for other corresponding descriptions of the functional units related to the web page crawling apparatus provided in the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 1, which are not described herein again, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the foregoing method embodiments.
Further, another apparatus for crawling a web page is provided in an embodiment of the present invention, as shown in fig. 4, the apparatus includes: a receiving unit 41, an acquiring unit 42, an extracting unit 43, and a crawling unit 44.
A receiving unit 41, configured to receive a crawler task, where the crawler task includes a URL of a page to be crawled;
an obtaining unit 42, configured to obtain, from a preset rule table, a region restriction rule corresponding to a URL matching rule that the URL is successfully matched, where the preset rule table stores multiple URL matching rules, each URL matching rule corresponds to at least one region restriction rule, and the region restriction rule is used to restrict a link to be crawled in a page corresponding to the URL by the crawler;
an extracting unit 43, configured to extract a link matching the region restriction rule from a page corresponding to the URL;
and the crawling unit 44 is used for crawling the webpage corresponding to the extracted link.
For the embodiment of the invention, a plurality of domain names are stored in the preset rule table, each domain name at least corresponds to one URL matching rule,
the extracting unit 43 is further configured to extract a domain name of the URL;
the obtaining unit 42 is further configured to obtain a domain name matched with the domain name of the URL from the preset rule table;
the obtaining unit 42 is specifically configured to obtain, from the obtained URL matching rule corresponding to the domain name, an area restriction rule corresponding to the URL rule that the URL matching is successful.
In an embodiment of the present invention, the apparatus further includes: a judgment unit 45;
the judging unit 45 is configured to judge whether the crawler task sets an area crawling limit function;
the extracting unit 43 extracts the domain name of the URL if the area crawling limit function is set specifically by using a crawler task.
The deduplication unit 46 is configured to perform deduplication processing on the extracted connection;
the crawling unit 44 is specifically configured to crawl a webpage corresponding to the duplicate-removed link.
It should be noted that, for other corresponding descriptions of functional units related to a web page crawling apparatus provided in the embodiment of the present invention, reference may be made to corresponding descriptions of the method shown in fig. 2, which are not described herein again, but it should be clear that the apparatus in the embodiment can correspondingly implement all contents in the foregoing method embodiments.
According to the webpage crawling device provided by the embodiment of the invention, a crawler program firstly receives a crawler task, the crawler task comprises a URL of a page to be crawled, then a region restriction rule corresponding to the URL matching rule successfully matched with the URL is obtained from a preset rule table, a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting a link to be crawled in the page corresponding to the URL by the crawler program, then the link matched with the region restriction rule is extracted from the page corresponding to the URL, and finally the webpage corresponding to the extracted link is crawled. Compared with the prior art that the links needing to be crawled in the webpage are specially marked, and the webpage contents corresponding to the specially marked links are retrieved from the webpage contents corresponding to all the crawled links, the embodiment of the invention firstly obtains the area restriction rule corresponding to the URL matching rule successfully matched with the current URL from the preset rule table after receiving the crawler task, then extracts the links matched with the area restriction rule from the page corresponding to the URL, and finally crawls the webpage corresponding to the extracted links.
The webpage crawling device comprises a processor and a memory, the receiving unit, the acquiring unit, the extracting unit, the crawling unit, the judging unit, the de-duplication unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the crawling efficiency of the specific link content in the webpage is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled; acquiring a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of URL matching rules are stored in the preset rule table, each URL matching rule at least corresponds to one region restriction rule, and the region restriction rules are used for restricting links to be crawled in a page corresponding to the URL by the crawler program; extracting a link matched with the region restriction rule from a page corresponding to the URL; and crawling a webpage corresponding to the extracted link.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for crawling a web page, comprising:
a crawler program receives a crawler task, wherein the crawler task comprises a URL (uniform resource locator) of a page to be crawled;
obtaining a region restriction rule corresponding to a URL matching rule successfully matched with the URL from a preset rule table, wherein a plurality of domain names are stored in the preset rule table, each domain name at least corresponds to one URL matching rule, the URL matching rule comprises a matching type and a matching content, the matching type is at least left matching and right matching, the matching content can be a character string or a regular expression, each URL matching rule at least corresponds to one region restriction rule, the region restriction rule is used for restricting links to be crawled in a page corresponding to the URL by the crawler program, and the region restriction rule can be a path expression;
the obtaining of the area restriction rule corresponding to the URL matching rule that the URL matching is successful from the preset rule table specifically includes: extracting the domain name of the URL in the crawler task, acquiring the domain name matched with the domain name of the URL from the preset rule table, and acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name;
extracting a link matched with the region restriction rule from a page corresponding to the URL;
and crawling a webpage corresponding to the extracted link.
2. The method according to claim 1, wherein a plurality of domain names are further stored in the preset rule table, each domain name corresponds to at least one URL matching rule, and before the region restriction rule corresponding to the URL rule that the URL matching is successful is obtained from the preset rule table, the method further comprises:
extracting the domain name of the URL;
acquiring a domain name matched with the domain name of the URL from the preset rule table;
the obtaining of the area restriction rule corresponding to the URL rule successfully matched with the URL from the preset rule table includes:
and acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name.
3. The method of claim 2, wherein after the crawler program receives the crawler task, the method further comprises:
judging whether the crawler task is provided with an area crawling limiting function or not;
the extracting the domain name of the URL comprises:
and if the crawler task sets a region crawling limit function, extracting the domain name of the URL.
4. The method according to any one of claims 1 to 3, wherein after extracting the link matching the region restriction rule from the page corresponding to the URL, the method further comprises:
carrying out de-duplication processing on the extracted link;
the crawling of the webpage corresponding to the extracted link comprises the following steps:
and crawling a webpage corresponding to the duplicate-removed link.
5. A web page crawling apparatus, comprising:
the system comprises a receiving unit, a crawling unit and a crawling unit, wherein the receiving unit is used for receiving a crawler task which comprises a URL (uniform resource locator) of a page to be crawled;
an obtaining unit, configured to obtain, from a preset rule table, a region restriction rule corresponding to a URL matching rule that the URL matching is successful, where the preset rule table stores a plurality of domain names, each domain name corresponds to at least one URL matching rule, the URL matching rule includes a matching type and matching content, the matching type is at least left matching and right matching, the matching content may be a character string or a regular expression, each URL matching rule corresponds to at least one region restriction rule, the region restriction rule is used to restrict a link to be crawled by the crawler in a page corresponding to the URL, and the region restriction rule may be a path expression;
wherein, the obtaining unit is further specifically configured to: extracting the domain name of the URL in the crawler task, acquiring the domain name matched with the domain name of the URL from the preset rule table, and acquiring a region restriction rule corresponding to the URL rule successfully matched with the URL from the URL matching rule corresponding to the acquired domain name;
the extracting unit is used for extracting a link matched with the region restriction rule from a page corresponding to the URL;
and the crawling unit is used for crawling the webpage corresponding to the extracted link.
6. The apparatus of claim 5, wherein the preset rule table further stores a plurality of domain names, each domain name corresponding to at least one URL matching rule,
the extracting unit is further configured to extract the domain name of the URL;
the obtaining unit is further configured to obtain a domain name matched with the domain name of the URL from the preset rule table;
the obtaining unit is specifically configured to obtain, from the obtained URL matching rule corresponding to the domain name, an area restriction rule corresponding to the URL rule that the URL matching is successful.
7. The apparatus of claim 6, further comprising: a judgment unit;
the judgment unit is used for judging whether the crawler task is provided with an area crawling limit function or not;
the extracting unit is specifically configured to extract the domain name of the URL if the crawler task sets a region crawling limit function.
8. The apparatus of any of claims 5-7, further comprising: a deduplication unit;
the duplication removing unit is used for carrying out duplication removing processing on the extracted connection;
and the crawling unit is specifically used for crawling the webpage corresponding to the duplicate-removed link.
9. A storage medium, comprising a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the web page crawling method according to any one of claims 1 to 4.
10. A processor, configured to execute a program, wherein the program executes the web page crawling method according to any one of claims 1 to 4.
CN201610082183.5A 2016-02-05 2016-02-05 Webpage crawling method and device Active CN107045507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610082183.5A CN107045507B (en) 2016-02-05 2016-02-05 Webpage crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610082183.5A CN107045507B (en) 2016-02-05 2016-02-05 Webpage crawling method and device

Publications (2)

Publication Number Publication Date
CN107045507A CN107045507A (en) 2017-08-15
CN107045507B true CN107045507B (en) 2020-08-21

Family

ID=59543081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610082183.5A Active CN107045507B (en) 2016-02-05 2016-02-05 Webpage crawling method and device

Country Status (1)

Country Link
CN (1) CN107045507B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020054B (en) * 2017-12-21 2022-10-25 腾讯科技(深圳)有限公司 Webpage content crawling method and device, computer equipment and storage medium
CN110874434A (en) * 2018-08-31 2020-03-10 珠海格力电器股份有限公司 Webpage data acquisition method and device, storage medium and electronic equipment
CN110968756B (en) * 2018-09-29 2023-05-12 北京国双科技有限公司 Webpage crawling method and device
CN110008390A (en) * 2019-02-27 2019-07-12 深圳壹账通智能科技有限公司 Appraisal procedure, device, computer equipment and the storage medium of application program
CN112579858A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Data crawling method and device
CN112541107A (en) * 2020-12-25 2021-03-23 天津浪淘科技股份有限公司 Page data learning and automatic acquisition method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN102087648A (en) * 2009-12-03 2011-06-08 北京大学 Method and system for fetching news comment page
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8020206B2 (en) * 2006-07-10 2011-09-13 Websense, Inc. System and method of analyzing web content
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
JP5430128B2 (en) * 2008-11-21 2014-02-26 三菱電機株式会社 URL conversion apparatus, URL conversion method, URL conversion program, and Web information collection system
CN102404281B (en) * 2010-09-09 2014-08-13 北京神州绿盟信息安全科技股份有限公司 Website scanning device and method
CN103984753B (en) * 2014-05-28 2018-02-09 北京京东尚科信息技术有限公司 A kind of web crawlers goes the extracting method and device of multiplex eigenvalue

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN102087648A (en) * 2009-12-03 2011-06-08 北京大学 Method and system for fetching news comment page
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system

Also Published As

Publication number Publication date
CN107045507A (en) 2017-08-15

Similar Documents

Publication Publication Date Title
CN107045507B (en) Webpage crawling method and device
CN105404699A (en) Method, device and server for searching articles of finance and economics
RU2016150421A (en) DEPTH LINKS FOR NATIVE APPLICATIONS
CN103617266A (en) Personalized extension search method, device and system
CN109977312B (en) Knowledge base recommendation system based on content tags
CN111259221A (en) Method, device, medium and system for realizing public opinion processing based on block chain
CN109582883B (en) Column page determination method and device
CN110020236B (en) Webpage parsing method, device, storage medium, processor and equipment
CN110020068B (en) Method and device for configuring page crawling rules
CN110969332A (en) Enterprise screening method and device
CN106682044B (en) Data processing method and device
CN110147473B (en) Crawling method and device for crawler
CN110020343B (en) Method and device for determining webpage coding format
CN109299423A (en) A method of obtaining network data
CN108121712B (en) Keyword storage method and device
Matsudaira Capturing and structuring data mined from the Web
CN111125087B (en) Data storage method and device
CN102929948B (en) list page identification system and method
CN112417239A (en) Webpage data crawling method and device
CN102902791B (en) Web page classification storage system and method
CN106815247B (en) Uniform resource locator obtaining method and device
CN110990799A (en) Data processing method, device and system for anti-crawler and storage medium
CN110971578B (en) User identity confirmation method and device
CN106997353B (en) Method and device for monitoring webpage version change
CN108062337B (en) Method and device for labeling crawler seeds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant