WO2018157686A1 - 网页爬取方法和装置 - Google Patents

网页爬取方法和装置 Download PDF

Info

Publication number
WO2018157686A1
WO2018157686A1 PCT/CN2018/074262 CN2018074262W WO2018157686A1 WO 2018157686 A1 WO2018157686 A1 WO 2018157686A1 CN 2018074262 W CN2018074262 W CN 2018074262W WO 2018157686 A1 WO2018157686 A1 WO 2018157686A1
Authority
WO
WIPO (PCT)
Prior art keywords
crawling
webpage
policy
website
crawl
Prior art date
Application number
PCT/CN2018/074262
Other languages
English (en)
French (fr)
Inventor
单长美
李玲
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018157686A1 publication Critical patent/WO2018157686A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present disclosure relates to, but is not limited to, the field of Internet technologies, and in particular, a web page crawling method and apparatus.
  • the web crawler is usually used to capture the information in the web page by crawling the website data.
  • the webpage crawling method known in the art captures the website data
  • all crawling methods are adopted, and the impurity data crawled is usually more than ten times of the effective data, which greatly increases the requirement for the storage space, and also the user's later data. Extraction adds to the difficulty.
  • the webpage contains a large number of links to websites that are not related to the theme. When crawling all the links in the webpage, it captures a large amount of useless impurity data and occupies a large amount of bandwidth resources, so the bandwidth requirement is high.
  • the present disclosure provides a webpage crawling method and apparatus, which have low requirements on storage space and bandwidth.
  • An embodiment of the present disclosure provides a webpage crawling method, including the following steps:
  • the crawling task includes a target website, and the crawling policy includes a URL restriction policy;
  • the URL restriction policy includes specifying that the URL performs only one crawl, or specifying that the URL performs a crawl every preset time, or specifying that the first URL performs only one crawl and the second.
  • the URL performs a crawl every preset time.
  • the crawling policy further includes a frequency limiting policy
  • the crawling the webpage of the target website in the crawling list in sequence comprises: crawling the webpage according to the frequency limiting policy at different frequencies Different content in the web page of the target website.
  • the crawling policy further includes a quantity limiting policy, wherein sequentially crawling the webpage of the target website in the crawling list comprises: crawling the target website according to the quantity limiting policy A predetermined number of specified content on a web page.
  • the crawling task further includes at least one of a task start/stop time, a task crawl depth, and a task daily cycle number and a cycle interval.
  • the webpage that sequentially crawls the target website in the crawling list includes:
  • the webpage information is denoised according to a preset parsing plugin, and the effective content in the webpage information is extracted and stored.
  • the parsing plug-in includes a general parsing plug-in or a custom parsing plug-in after secondary development by the user to the universal parsing plug-in.
  • the embodiment of the present disclosure further provides a webpage crawling device, where the device includes:
  • a configuration module configured to: configure a crawling task and a crawling policy; the crawling task includes a target website, and the crawling policy includes a URL limiting policy;
  • a webpage crawling module configured to: generate a crawling list according to the target website; crawl the webpage of the target website in the crawling list in turn, and obtain a website link in the webpage;
  • a link filtering module configured to: filter the website link according to the URL restriction policy to filter out invalid links in the website link, and join the filtered remaining website link as a link of the target website to the crawl list For the subsequent crawling of the web crawl module.
  • the URL restriction policy includes specifying that the URL performs only one crawl, or specifying that the URL performs a crawl every preset time, or specifying that the first URL performs only one crawl and the second.
  • the URL performs a crawl every preset time.
  • the crawling policy further includes a frequency limiting policy, where the webpage crawling module is configured to: crawl the webpage of the target website at different frequencies according to the frequency limiting policy. Content.
  • the crawling policy further includes a quantity limiting policy
  • the webpage crawling module is configured to: crawl a predetermined number of specified webpages of the target website according to the quantity limiting policy content.
  • the crawling task further includes at least one of a task start/stop time, a task crawl depth, and a task daily cycle number and a cycle interval.
  • the webpage crawling module includes:
  • the crawling unit is configured to: crawl webpage information of the target website;
  • the parsing unit is configured to: perform denoising processing on the webpage information according to a preset parsing plugin, extract and save the valid content in the webpage information.
  • the apparatus further includes a plug-in development module, and the plug-in development module is configured to: receive an instruction for secondary development of the universal parsing plug-in by the user, and generate a custom parsing plug-in.
  • Embodiments of the present disclosure also provide a computer readable storage medium storing computer executable instructions that, when executed, implement the web page crawling method described above.
  • the webpage crawling method of the embodiment of the present disclosure configures a URL restriction policy, filters a website link in the crawled webpage according to the URL restriction policy, filters out invalid links in the website link, and uses the remaining website link as the target website after filtering.
  • the link is added to the crawl list for subsequent crawling.
  • the parsing plug-in is used to denoise the captured webpage information, and the effective content in the webpage information is extracted and stored, thereby greatly reducing the requirement of the storage space and reducing the interference of the impurity data, and the data for the user later. Extraction reduces the difficulty.
  • the user is allowed to perform secondary development on the general parsing plug-in to generate a custom parsing plug-in, and the webpage information is parsed by the custom parsing plug-in, thereby realizing accurate crawling of the website data and satisfying the personalized needs of the user.
  • FIG. 1 is a flowchart of a webpage crawling method according to a first embodiment of the present disclosure
  • FIG. 2 is a schematic block diagram of a webpage crawling device according to a second embodiment of the present disclosure
  • FIG. 3 is a block diagram of a webpage crawling module of FIG. 2;
  • FIG. 4 is a schematic block diagram of a webpage crawling device according to a third embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of interaction of multiple modules when the web crawling device of FIG. 4 performs web crawling.
  • a webpage crawling method according to a first embodiment of the present disclosure is proposed.
  • the method includes the following steps:
  • the crawling task includes a target website, and the crawling policy includes a URL restriction policy.
  • the webpage crawling device can receive the configuration operation of the user, and configure the crawling task and the crawling policy.
  • the crawling task includes at least the target website, that is, the setting of the website portal to which the user is to be crawled, and the target website to be crawled is configured according to the setting.
  • the crawling task may further include at least one of the daily start and stop time of the task (ie, the opening time and the stopping time), the task crawling depth, the number of times of the task, and the cycle interval, that is, the user may also configure the daily startup time of the task. Parameter information such as the daily stop time of the task, the depth of the task crawl, the number of cycles per task, and the interval of the task cycle.
  • the crawling strategy includes at least a Uniform Resource Locator (URL) restriction policy
  • the URL restriction policy may include specifying that the URL only performs crawling once, or specifying that the URL performs crawling every preset time, or specifying a certain A URL (herein referred to as the first URL) performs only one crawl and specifies another URL (herein referred to as the first URL) to perform a crawl every preset time, for example: after some URLs perform a crawl, subsequent You can not crawl again; some URLs are crawled once and are no longer crawled for a while.
  • the crawling policy may further include a frequency limiting policy, a quantity limiting policy, and the like. wherein the frequency limiting policy refers to setting different crawling frequencies for different content in the webpage, and the quantity limiting policy refers to crawling only the webpage. Set the specified content.
  • the webpage crawling device can first read the target website configured by the user, merge the URLs of the target website, and eliminate duplicate URL entries; and then sort the merged URLs, such as by domain name and number of links. And the hash algorithm is integrated in descending (or ascending) order to generate a crawl list.
  • the webpage crawling device may sequentially crawl the webpage of the target website according to the order of the URLs of each of the target websites in the at least one target website in the crawling list.
  • the webpage crawling device may send a request to the target website to capture the webpage information of the target website, and the webpage information may include various webpage contents such as a text, a comment, a website link, and the like, and store the webpage information.
  • the web crawler can be configured to multi-thread for crawling to improve the crawling efficiency.
  • the website can adopt a specific crawling strategy to avoid the anti-climbing design of the website, such as reducing the crawling frequency and extending the crawling. Cycle, use multiple machines for crawling and other strategies.
  • the parsing plug-in may also be preset, and the parsing plug-in may be implemented by using a readabilityBUNDLE algorithm, and the parsing plug-in may be used to perform denoising processing on the crawled webpage information, so as to simplify the webpage information and remove the advertisement in the webpage information.
  • Invalid content or non-essential content
  • the effective content such as the title, article, and comment in the webpage information is extracted, and only the effective content is stored, thereby greatly reducing the requirement for the storage space and reducing The interference of the impurity data is small, which makes the user's later data extraction less difficult.
  • the parsing plugin parses the webpage information into structured data
  • the storage module of the web crawler stores the parsed structured data into the file system.
  • the file is larger than one file for storage.
  • the maximum capacity of each file may be 10M (the maximum capacity can be modified), which facilitates the processing of subsequent data files.
  • the foregoing parsing plug-in may include a factory-prepared general-purpose parsing plug-in, and may also include a custom parsing plug-in after the user re-develops the general-purpose parsing plug-in. For example, if the user has special needs, to parse specific information such as article, author, publication time, date, etc., the user can edit the general parsing plug-in to obtain a custom parsing plug-in, and the web crawler can load the custom parsing plugin. The user requests to parse the webpage information, and parses the webpage information into the structured data that the user needs, so as to realize accurate crawling of the website data according to the user's request.
  • the web crawling device crawls different content in the webpage of the target website at different frequencies according to the frequency limiting policy. For example, for news sites, crawling of news content can be very frequent (such as crawling once an hour), but crawling of the content can be done once a day. Thus, on the one hand, the crawling efficiency is improved, on the other hand, useless impurity data is reduced, and the storage space requirement is reduced.
  • the web crawler crawls a preset number of specified content in the webpage of the target website according to the quantity limiting policy. For example, for crawling of the comment content, it is possible to crawl only the preset number of comment content, or only crawl the comment content of the preset page number (such as the first few pages). Thus, on the one hand, the crawling efficiency is improved, on the other hand, useless impurity data is reduced, and the storage space requirement is reduced.
  • the webpage crawling device may filter the website link in the currently crawled webpage according to the configured URL restriction policy, filter out the invalid link in the website link, and only use the remaining website link as the target website.
  • the link is added to the crawl list for subsequent crawling.
  • the URL restriction policy is to perform a crawl only once.
  • the URL is filtered out, and the URL is no longer crawled.
  • the URL restriction policy is to perform a crawl every preset time.
  • the URL is filtered out within a preset duration, that is, the URL is no longer crawled for a period of time.
  • the web crawling device can also monitor the crawling task, such as monitoring the running status of the task, including whether it is in a running state, the last successful execution time, the last successful execution time, the last execution failure time, and the like. To facilitate users to view and manage in real time.
  • the webpage crawling device can also manage the crawling task, including adding a task, deleting a task, starting a task, stopping a task, immediately starting a task, and viewing task information, so as to facilitate real-time management of the crawling task by the user. .
  • the webpage crawling method of the embodiment of the present disclosure effectively filters the irrelevant website by controlling the crawled outer chain, reduces the crawling data of the website, and more largely locates the crawling of the useful information, thereby improving
  • the crawling efficiency reduces the useless impurity data, which reduces the storage space requirements and greatly reduces the bandwidth occupation.
  • the apparatus includes a configuration module 10, a webpage crawling module 20, and a link filtering module 30, wherein:
  • Configuration Module 10 Set to configure crawl tasks and crawl policies.
  • the configuration module 10 may be configured to: receive a configuration operation of the user, and configure the crawling task and the crawling policy.
  • the crawling task includes at least a target website, that is, the configuration module 10 may be configured to: receive a setting of a website entry that the user is going to crawl, and configure a target website to be crawled according to the setting.
  • the crawling task may further include at least one of a daily start and stop time of the task, a task crawling depth, and a daily cycle number of the task and a cycle interval, that is, the user may also configure the daily startup time of the task, the daily stop time of the task, and the task crawling. Take the parameter information such as depth, number of cycles per task, and interval between task cycles.
  • the crawling strategy includes at least a URL restriction policy
  • the URL restriction policy may include specifying that the URL only performs crawling once, or specifying that the URL performs a crawl every preset time, or specifying a certain URL (herein referred to as a first URL). Perform crawling only once and specify another URL (herein referred to as the second URL) to perform a crawl every preset time. For example, after some URLs are crawled once, they do not need to be crawled again; some URLs are crawled. After taking it once, it will not crawl for a while.
  • the crawling policy may further include a frequency limiting policy, a quantity limiting policy, and the like. wherein the frequency limiting policy refers to setting different crawling frequencies for different content in the webpage, and the quantity limiting policy refers to crawling only the webpage. Set the specified content.
  • the webpage crawling module 20 is configured to generate a crawling list according to the target website, sequentially crawl the webpage of the target website in the crawling list, and obtain the website link in the webpage.
  • the webpage crawling module 20 may include a generating unit 201 and a crawling unit 202.
  • the generating unit 201 is configured to generate a crawling list according to the target website
  • the crawling unit 202 is configured to capture webpage information of the target website.
  • the generating unit 201 may be configured to: read the target website configured by the user, merge the URLs of the target website, eliminate duplicate URL entries; and then sort the merged URLs, such as by domain name, number of links, and hash (The hash algorithm is integrated in descending (or ascending) order to generate a crawl list.
  • the crawling unit 202 may be configured to: sequentially crawl the webpage of the target website according to the order of the URLs of each of the target websites in the at least one target website in the crawling list.
  • the crawling unit 202 may be configured to: send a request to the target website to capture webpage information of the target website, where the webpage information includes various webpage contents such as a text, a comment, a website link, and the like, and store the webpage information.
  • the crawling unit 202 can be configured to: configure multi-threading for crawling to improve the crawling efficiency, and the website of the same domain name can adopt a specific crawling strategy to avoid the anti-climbing design of the website, such as reducing the crawling frequency. , extend the crawl cycle, use multiple machines to crawl and other strategies.
  • the webpage crawling module 20 further includes a parsing unit 203, and the parsing unit 203 is configured to: perform denoising processing on the webpage information according to the preset parsing plugin, extract the effective content in the webpage information, and store the webpage information.
  • the parsing plugin parses the web page information into structured data.
  • the parsing plug-in can be implemented by using the readabilityBUNDLE algorithm.
  • the parsing unit 203 can be configured to: after loading the parsing plug-in, use the parsing plug-in to perform denoising processing on the crawled webpage information, so as to simplify the webpage information and remove the advertisement in the webpage information.
  • Invalid content or non-essential content
  • only the effective content such as the title, article, and comment in the webpage information is extracted, and only the effective content is stored, thereby greatly reducing the storage space requirement and reducing the requirement.
  • the interference of impurity data reduces the difficulty for users to extract data later.
  • the webpage crawling module 20 is further configured to: crawl different content in the webpage of the target website at different frequencies according to the frequency limiting policy. For example, for news sites, crawling of news content can be very frequent (such as crawling once an hour), but crawling of the content can be done once a day. Thus, on the one hand, the crawling efficiency is improved, on the other hand, useless impurity data is reduced, and the storage space requirement is reduced.
  • the webpage crawling module 20 is further configured to: crawl the preset number of specified content in the webpage of the target website according to the quantity limiting policy. For example, for crawling of the comment content, it is possible to crawl only the preset number of comment content, or only crawl the comment content of the preset page number (such as the first few pages). Thus, on the one hand, the crawling efficiency is improved, on the other hand, useless impurity data is reduced, and the storage space requirement is reduced.
  • the link filtering module 30 is configured to filter the website link according to the URL restriction policy to filter the invalid link in the website link, and add the filtered remaining website link as a link of the target website to the crawl list for the web crawl module 20 Follow-up crawling.
  • the link filtering module 30 may be configured to: filter the website link in the currently crawled webpage according to the configured URL restriction policy, filter out the invalid link in the website link, and only use the remaining website link as the target website link. Join the crawl list, update the crawl list, and wait for the web crawl module 20 to subsequently crawl the newly added website link.
  • the URL restriction policy is to perform a crawl only once.
  • the link filtering module 30 filters out the URL, so that the web crawl module 20 does not subsequently crawl the URL. URL.
  • the URL restriction policy is to perform a crawl every preset time.
  • the link filtering module 30 filters out the URL within a preset duration, that is, the web crawl module 20 for a period of time. The URL is no longer crawled.
  • the webpage crawling device may further include a storage module, and the storage module is configured to: store the parsed structured data into the file system.
  • the storage module is configured to: store the parsed structured data into the file system.
  • the maximum capacity of each file may be 10M (the maximum capacity can be modified), which facilitates the processing of subsequent data files.
  • the aforementioned parsing plugin can include a factory-prepared generic parsing plugin.
  • the device may further include a plug-in development module, where the plug-in development module is configured to: receive a user's secondary development of the general-purpose parsing plug-in, and generate a custom parsing plug-in.
  • the user can download the custom parsing plug-in through the plug-in development module to edit the general parsing plug-in, and the web crawl module 20 can load the self.
  • the parsing plug-in parse the webpage information according to the user's requirements, and parse the webpage information into the structured data that the user needs, so as to realize the accurate crawling of the website data according to the user's requirements.
  • the webpage crawling device of the embodiment of the present disclosure effectively filters the irrelevant website by controlling the crawled outer chain, reduces the crawling data of the website, and locates the crawling of the useful information to a greater extent, thereby improving
  • the crawling efficiency reduces the useless impurity data, which reduces the storage space requirements and greatly reduces the bandwidth occupation.
  • the apparatus includes a graphical user interface module 100, a basic supporting module 200, a plug-in developing module 300, a crawling module 400, and a storage module 500, wherein:
  • the basic support module 200 is configured to provide basic services for web crawling, including various configuration, management, and monitoring services.
  • the basic support module 200 interacts with the user, and the user can operate the task interactively, and the system supports multi-task running at the same time. Through this module, the entire system is managed, and the user-configured target seed (such as the target website) and various crawling policies are received, and the received user-configured information is saved in the configuration file for subsequent crawling.
  • the basic support module 200 may include the configuration module 10 and the supervision module, and the configuration module 10 is the same as the configuration module 10 in the second embodiment, and details are not described herein.
  • the monitoring module is configured to monitor and manage the crawling task, wherein: when performing task monitoring, monitor the running status of the task, including whether it is in a running state, the last successful execution time, the last successful execution time, and the last execution failure time. Wait for the user to view and manage in real time; when performing task management, including adding tasks, deleting tasks, starting tasks, stopping tasks, starting tasks immediately, viewing task information, etc., to facilitate real-time management of crawling tasks.
  • the graphical user interface module 100 is configured to provide a graphical display interface for the user, which is convenient for the user to perform graphical operations, including crawling task configuration, crawling policy configuration, task monitoring, task management, and graphical display and operation of plug-in development. User interactive operation greatly enhances ease of use.
  • the plug-in development module 300 is configured to receive an instruction for secondary development of the universal parsing plug-in by the user, and generate a custom parsing plug-in. Users can develop user-specific parsing plug-ins on the graphical interface as needed.
  • the plug-in development module 300 in this embodiment is the same as the plug-in development module 300 in the second embodiment, and details are not described herein.
  • the crawl module 400 is configured to generate a crawl list according to the target website, sequentially crawl the webpage of the target website in the crawling list, obtain the website link in the webpage, and filter the website link according to the URL restriction policy to filter out the invalidity in the website link. Linking, and adding the remaining website link as a link to the target website to the crawl list for subsequent crawling by the web crawl module 20.
  • the crawling module in this embodiment is equivalent to the combination of the webpage crawling module 20 and the link filtering module 30 in the second embodiment. See the webpage crawling module 20 and the link filtering module 30 in the second embodiment. I will not repeat them here.
  • the storage module 500 is configured to store webpage information crawled by the crawl module.
  • the crawl module parses the web page information, the parsed structured data is stored in the file system.
  • the file is larger than one file for storage.
  • the maximum capacity of each file may be 10M (the maximum capacity can be modified), which facilitates the processing of subsequent data files.
  • Step 101 When the user performs operations such as crawling task configuration, crawling policy configuration, and task management, the graphical user interface module sends an operation command to the basic support module, and the basic support module parses the operation command, and performs corresponding processing.
  • operations such as crawling task configuration, crawling policy configuration, and task management
  • Step 102 After the basic support module performs corresponding processing on the operation command of the user, the operation result is returned to the user, and the information, such as configuration and operation information, is saved.
  • Step 103 After the user performs the plug-in development and editing online, the graphical user interface sends an operation command to the plug-in development module, and the plug-in development module parses the operation command, and performs corresponding processing.
  • Step 104 The plug-in development module generates the user-developed parsing plug-in as a custom parsing plug-in for later parsing the webpage, saves the information, and returns the operation result to the graphical user interface for display to the user.
  • Step 105 The user sends an immediate start task command to the crawl module through the graphical user interface module, and the crawl module performs a corresponding reaction.
  • Step 106 When the configured task startup time expires, the crawl module responds accordingly.
  • Step 107 When receiving the immediate start task command or when the task start time arrives, the crawl module starts the crawl task, crawls the webpage, parses the webpage, and adds the filtered outer chain to the webpage to be crawled (such as crawling the list).
  • Step 108 After the crawl module is crawled, the storage command is sent to the storage module to notify the storage data.
  • Step 109 After receiving the storage command, the storage module stores the structured data of the webpage in the file, and stores the file according to the data size.
  • Step 110 After the storage module is stored, return the crawl result to the graphical user interface to notify the user that all operations are completed and update the task status by giving the graphical user interface.
  • the webpage crawling device of the embodiment of the present disclosure filters the website link in the crawled webpage according to the URL restriction policy by configuring the URL restriction policy, so as to filter out the invalid link in the website link, and use the remaining website link as the target website after filtering.
  • the link is added to the crawl list for subsequent crawling.
  • the parsing plug-in is used to denoise the captured webpage information, and the effective content in the webpage information is extracted and stored, thereby greatly reducing the requirement of the storage space and reducing the interference of the impurity data, and the data for the user later. Extraction reduces the difficulty.
  • the user is allowed to perform secondary development on the general parsing plug-in to generate a custom parsing plug-in, and the webpage information is parsed by the custom parsing plug-in, thereby realizing accurate crawling of the website data and satisfying the personalized needs of the user.
  • the webpage crawling device of the embodiment of the present disclosure can be set in a single machine or in a hadoop cluster.
  • Embodiments of the present disclosure also provide a computer readable storage medium storing computer executable instructions that, when executed, implement the web page crawling method described above.
  • computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media.
  • Computer storage media include, but are not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), and Electrically Erasable Programmable Read-only Memory (EEPROM). Flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cassette, magnetic tape, disk storage or other magnetic storage device, or Any other medium used to store the desired information and that can be accessed by the computer.
  • communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .
  • the webpage crawling method of the embodiment of the present disclosure configures a URL restriction policy, filters a website link in the crawled webpage according to the URL restriction policy, filters out invalid links in the website link, and uses the remaining website link as the target website after filtering.
  • the link is added to the crawl list for subsequent crawling.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网页爬取方法包括:配置爬取任务和爬取策略;所述爬取任务包括目标网站,所述爬取策略包括URL限制策略;根据所述目标网站生成爬取列表;依次爬取所述爬取列表中目标网站的网页,获取所述网页中的网站链接;根据所述URL限制策略过滤所述网站链接,以滤除所述网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入所述爬取列表中以供后续爬取。

Description

网页爬取方法和装置 技术领域
本公开涉及但不限于互联网技术领域,尤其是一种网页爬取方法和装置。
背景技术
随着网络信息技术的迅猛发展,网站上的大数据正呈指数级形式飞速增长,网页已经成为海量信息的载体。通常采用网络爬虫来抓取网站数据的方式来采集网页中的信息。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本领域已知的网页爬取方式抓取网站数据时,采用全部爬取方式,爬取的杂质数据通常是有效数据的十倍以上,既大大增加了对存储空间的要求,也为用户后期数据提取增加了难度。特别是网页中包含大量与主题无关的网站链接,对网页中所有的网站链接进行爬取时,既抓取了大量无用的杂质数据,又占用了大量的带宽资源,因此对带宽要求较高。
本公开提供一种网页爬取方法和装置,对存储空间和带宽的要求不高。
本公开实施例提供一种网页爬取方法,包括以下步骤:
配置爬取任务和爬取策略;所述爬取任务包括目标网站,所述爬取策略包括URL限制策略;
根据所述目标网站生成爬取列表;
依次爬取所述爬取列表中目标网站的网页,获取所述网页中的网站链接;
根据所述URL限制策略过滤所述网站链接,以滤除所述网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入所述爬取列表中以供后续爬取。
在一种示例性实施方式中,所述URL限制策略包括指定URL只执行一 次爬取,或指定URL每隔预设时长执行一次爬取,或指定第一URL只执行一次爬取和指定第二URL每隔预设时长执行一次爬取。
在一种示例性实施方式中,所述爬取策略还包括频率限制策略,所述依次爬取所述爬取列表中目标网站的网页包括:根据所述频率限制策略以不同的频率爬取所述目标网站的网页中不同的内容。
在一种示例性实施方式中,所述爬取策略还包括数量限制策略,所述依次爬取所述爬取列表中目标网站的网页包括:根据所述数量限制策略爬取所述目标网站的网页中预设数量的指定内容。
在一种示例性实施方式中,所述爬取任务还包括任务每天启停时间、任务爬取深度和任务每天循环次数及循环间隔时间中的至少一种。
在一种示例性实施方式中,所述依次爬取所述爬取列表中目标网站的网页包括:
抓取所述目标网站的网页信息;
根据预设的解析插件对所述网页信息进行去噪处理,提取出所述网页信息中的有效内容并予以存储。
在一种示例性实施方式中,所述解析插件包括通用解析插件或经用户对通用解析插件进行二次开发后的自定义解析插件。
本公开实施例还提供一种网页爬取装置,所述装置包括:
配置模块,设置为:配置爬取任务和爬取策略;所述爬取任务包括目标网站,所述爬取策略包括URL限制策略;
网页爬取模块,设置为:根据所述目标网站生成爬取列表;依次爬取所述爬取列表中目标网站的网页,获取所述网页中的网站链接;
链接过滤模块,设置为:根据所述URL限制策略过滤所述网站链接,以滤除所述网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入所述爬取列表中以供所述网页爬取模块后续爬取。
在一种示例性实施方式中,所述URL限制策略包括指定URL只执行一次爬取,或指定URL每隔预设时长执行一次爬取,或指定第一URL只执行一次爬取和指定第二URL每隔预设时长执行一次爬取。
在一种示例性实施方式中,所述爬取策略还包括频率限制策略,所述网 页爬取模块是设置为:根据所述频率限制策略以不同的频率爬取所述目标网站的网页中不同的内容。
在一种示例性实施方式中,所述爬取策略还包括数量限制策略,所述网页爬取模块是设置为:根据所述数量限制策略爬取所述目标网站的网页中预设数量的指定内容。
在一种示例性实施方式中,所述爬取任务还包括任务每天启停时间、任务爬取深度和任务每天循环次数及循环间隔时间中的至少一种。
在一种示例性实施方式中,所述网页爬取模块包括:
抓取单元,设置为:抓取所述目标网站的网页信息;
解析单元,设置为:根据预设的解析插件对所述网页信息进行去噪处理,提取出所述网页信息中的有效内容并予以存储。
在一种示例性实施方式中,所述装置还包括插件开发模块,所述插件开发模块设置为:接收用户对通用解析插件进行二次开发的指令,生成自定义解析插件。
本公开实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被执行时实现上述网页爬取方法。
本公开实施例的网页爬取方法,通过配置URL限制策略,根据URL限制策略过滤爬取的网页中的网站链接,以滤除网站链接中的无效链接,将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中以供后续爬取。从而有效过滤了无关网站,减少了网站爬取数据,更大程度上定位到有用信息的爬取,既提高了爬取效率,减少了无用的杂质数据,进而降低了对存储空间的要求,又大大减少了对带宽的占用,对存储空间和带宽的要求不高。
同时,利用解析插件对抓取的网页信息进行去噪处理,提取出网页信息中的有效内容予以存储,从而大大降低了对存储空间的要求,并减小了杂质数据的干扰,为用户后期数据提取降低了难度。并且,允许用户对通用解析插件进行二次开发生成自定义解析插件,利用自定义解析插件解析网页信息,实现了对网站数据进行精确爬取,并满足了用户的个性化需求。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图概述
图1为本公开第一实施例的网页爬取方法的流程图;
图2为本公开第二实施例的网页爬取装置的模块示意图;
图3为图2中的网页爬取模块的模块示意图;
图4为本公开第三实施例的网页爬取装置的模块示意图;
图5为图4中的网页爬取装置进行网页爬取时多个模块的交互示意图。
本公开的较佳实施方式
下面结合附图对本公开的实施方式进行描述。
实施例一
参见图1,提出本公开第一实施例的网页爬取方法,所述方法包括以下步骤:
S11、配置爬取任务和爬取策略,爬取任务包括目标网站,爬取策略包括URL限制策略。
本步骤S11中,网页爬取装置可以接收用户的配置操作,对爬取任务和爬取策略进行配置。
爬取任务至少包括目标网站,即可以接收用户对待爬取的网站入口的设置,根据该设置配置出待爬取的目标网站。此外,爬取任务还可以包括任务每天启停时间(即开启时间和停止时间)、任务爬取深度和任务每天循环次数及循环间隔时间中的至少一种,即用户还可以配置任务每天启动时间、任务每天停止时间、任务爬取深度、任务每天循环次数、任务循环间隔时间等参数信息。
爬取策略至少包括统一资源定位器(Uniform Resource Locator,URL)限制策略,所述URL限制策略可以包括指定URL只执行一次爬取,或指定URL每隔预设时长执行一次爬取,或指定某一URL(这里称为第一URL)只执行一次爬取和指定另一URL(这里称为第一URL)每隔预设时长执行一次爬取,例如:某些URL执行一次爬取之后,后续可以不再次爬取;某些URL爬取一次之后,一段时间内不再爬取。可选地,爬取策略还可以包括频率限制策略、数量限制策略等,其中,频率限制策略是指对网页中不同的内 容设置不同的爬取频率,数量限制策略是指只爬取网页中预设数量的指定内容。
S12、根据目标网站生成爬取列表。
本步骤S12中,网页爬取装置首先可以读取用户配置的目标网站,将目标网站的URL进行合并,消除重复的URL入口;然后可以对合并处理后的URL进行排序,如通过域名、链接数和哈希(hash)算法综合进行降序(或升序)排列,生成爬取列表。
S13、依次爬取爬取列表中目标网站的网页,获取网页中的网站链接。
本步骤S13中,网页爬取装置可以根据爬取列表中至少一个目标网站中每个目标网站的URL的排列顺序,依次爬取目标网站的网页。网页爬取装置可以向目标网站发送请求,抓取目标网站的网页信息,该网页信息可以包括正文、评论、网站链接等各种网页内容,并将网页信息存储起来。网页爬取装置可配置多线程进行抓取,以提高抓取效率,对于同一域名的网站可以采取特定的爬取策略,以避开网站的防爬设计,如采取降低爬取频率、延长爬取周期、采用多台机器进行爬取等策略。
可选地,还可以预设解析插件,该解析插件可以采用readabilityBUNDLE算法来实现,可以利用解析插件对抓取的网页信息进行去噪处理,以对网页信息进行精简化,去除网页信息中的广告、网站背景等无效内容(或非必要内容),只提取出网页信息中的标题、文章、评论等有效内容,只将有效内容存储起来,由此,大大降低了对存储空间的要求,并减小了杂质数据的干扰,为用户后期数据提取降低了难度。
可选地,解析插件将网页信息解析为结构化数据,网页爬取装置的存储模块将解析过后的结构化数据存储到文件系统中。可选地,若一次抓取的数据过大则分大于一个文件进行存储,如每个文件最大容量可为10M(最大容量可修改),方便后续数据文件的处理。
前述解析插件可以包括出厂预置的通用解析插件,也可以包括经用户对通用解析插件进行二次开发后的自定义解析插件。例如,用户有特殊需求,要解析出文章、作者、发表时间、日期等特定信息,则用户可以在线编辑通用解析插件获取自定义解析插件,网页爬取装置则可以加载该自定义解析插 件,按照用户要求解析网页信息,将网页信息解析成用户需要的结构化数据,从而实现根据用户要求对网站数据进行精确爬取。
可选地,当爬取策略还包括频率限制策略时,网页爬取装置则根据频率限制策略以不同的频率爬取目标网站的网页中不同的内容。例如,针对新闻网站,对新闻内容的爬取可以非常频繁(如一个小时爬取一次),但评论内容的爬取则可以一天一次。从而,一方面提高了爬取效率,另一方面减少了无用的杂质数据,降低了对存储空间的要求。
可选地,当爬取策略还包括数量限制策略时,网页爬取装置则根据数量限制策略爬取目标网站的网页中预设数量的指定内容。例如,对于评论内容的爬取,可以只爬取预设条数的评论内容,或者只爬取预设页数(如最前面的几页)的评论内容。从而,一方面提高了爬取效率,另一方面减少了无用的杂质数据,降低了对存储空间的要求。
S14、根据URL限制策略过滤网站链接,将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中以供后续爬取。
本步骤S14中,网页爬取装置可以根据配置的URL限制策略对当前爬取的网页中的网站链接进行过滤,滤除网站链接中的无效链接,只将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中,以待后续爬取。
例如,对于某些URL,URL限制策略为只执行一次爬取。相应地,对于这些URL中的任意一个URL,当网页爬取装置对该URL执行一次爬取之后,则滤除该URL,后续不再爬取该URL。
又如,对于某些URL,URL限制策略为每隔预设时长执行一次爬取。相应地,对于这些URL中的任意一个URL,当网页爬取装置对该URL执行一次爬取之后,在预设时长内滤除该URL,即一段时间内不再爬取该URL。
可选地,网页爬取装置还可以对爬取任务进行监控,例如监控任务的运行状态,包括是否处于正在运行状态、上次成功执行时间、上次成功执行时长、上次执行失败时间等等,以方便用户实时查看和管理。
可选地,网页爬取装置还可以对爬取任务进行管理,包括添加任务、删除任务、启动任务、停止任务、立即启动任务、查看任务信息等操作,以方便用户对爬取任务进行实时管理。
从而,本公开实施例的网页爬取方法,通过对爬取到的外链进行控制,有效过滤了无关网站,减少了网站爬取数据,更大程度上定位到有用信息的爬取,既提高了爬取效率,减少了无用的杂质数据,进而降低了对存储空间的要求,又大大减少了对带宽的占用。
实施例二
参见图2,提出本公开第二实施例的网页爬取装置,所述装置包括配置模块10、网页爬取模块20和链接过滤模块30,其中:
配置模块10:设置为配置爬取任务和爬取策略。
本实施例中,配置模块10可以是设置为:接收用户的配置操作,对爬取任务和爬取策略进行配置。
爬取任务至少包括目标网站,即配置模块10可以是设置为:接收用户对待爬取的网站入口的设置,根据该设置配置出待爬取的目标网站。此外,爬取任务还可以包括任务每天启停时间、任务爬取深度和任务每天循环次数及循环间隔时间中的至少一种,即用户还可以配置任务每天启动时间、任务每天停止时间、任务爬取深度、任务每天循环次数、任务循环间隔时间等参数信息。
爬取策略至少包括URL限制策略,所述URL限制策略可以包括指定URL只执行一次爬取,或指定URL每隔预设时长执行一次爬取,或指定某一URL(这里称为第一URL)只执行一次爬取和指定另一URL(这里称为第二URL)每隔预设时长执行一次爬取,例如:某些URL执行一次爬取之后,后续不需要再次爬取;某些URL爬取一次之后,一段时间内不再爬取。可选地,爬取策略还可以包括频率限制策略、数量限制策略等,其中,频率限制策略是指对网页中不同的内容设置不同的爬取频率,数量限制策略是指只爬取网页中预设数量的指定内容。
网页爬取模块20:设置为根据目标网站生成爬取列表,依次爬取爬取列表中目标网站的网页,获取网页中的网站链接。
如图3所示,网页爬取模块20可以包括生成单元201和抓取单元202,生成单元201设置为根据目标网站生成爬取列表,抓取单元202设置为抓取目标网站的网页信息。
生成单元201可以是设置为:读取用户配置的目标网站,将目标网站的URL进行合并,消除重复的URL入口;然后对合并处理后的URL进行排序,如通过域名、链接数和哈希(hash)算法综合进行降序(或升序)排列,生成爬取列表。
抓取单元202可以是设置为:根据爬取列表中至少一个目标网站中每个目标网站的URL的排列顺序,依次爬取目标网站的网页。可选地,抓取单元202可以是设置为:向目标网站发送请求,抓取目标网站的网页信息,该网页信息包括正文、评论、网站链接等各种网页内容,并将网页信息存储起来。抓取单元202可以是设置为:配置多线程进行抓取,以提高抓取效率,对于同一域名的网站可以采取特定的爬取策略,以避开网站的防爬设计,如采取降低爬取频率、延长爬取周期、采用多台机器进行爬取等策略。
可选地,网页爬取模块20还包括解析单元203,该解析单元203设置为:根据预设的解析插件对网页信息进行去噪处理,提取出网页信息中的有效内容并予以存储。可选地,解析插件将网页信息解析为结构化数据。
解析插件可以采用readabilityBUNDLE算法来实现,解析单元203可以是设置为:加载解析插件后,利用解析插件对抓取的网页信息进行去噪处理,以对网页信息进行精简化,去除网页信息中的广告、网站背景等无效内容(或非必要内容),只提取出网页信息中的标题、文章、评论等有效内容,只将有效内容存储起来,从而大大降低了对存储空间的要求,并减小了杂质数据的干扰,为用户后期数据提取降低了难度。
可选地,当爬取策略还包括频率限制策略时,网页爬取模块20还设置为:根据频率限制策略以不同的频率爬取目标网站的网页中不同的内容。例如,针对新闻网站,对新闻内容的爬取可以非常频繁(如一个小时爬取一次),但评论内容的爬取则可以一天一次。从而,一方面提高了爬取效率,另一方面减少了无用的杂质数据,降低了对存储空间的要求。
可选地,当爬取策略还包括数量限制策略时,网页爬取模块20还设置为:根据数量限制策略爬取目标网站的网页中预设数量的指定内容。例如,对于评论内容的爬取,可以只爬取预设条数的评论内容,或者只爬取预设页数(如最前面的几页)的评论内容。从而,一方面提高了爬取效率,另一方面减少了无用的杂质数据,降低了对存储空间的要求。
链接过滤模块30:设置为根据URL限制策略过滤网站链接,以滤除网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中以供网页爬取模块20后续爬取。
链接过滤模块30可以是设置为:根据配置的URL限制策略对当前爬取的网页中的网站链接进行过滤,滤除网站链接中的无效链接,只将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中,更新爬取列表,以待网页爬取模块20后续爬取新加入的网站链接。
例如,对于某些URL,URL限制策略为只执行一次爬取。相应地,对于这些URL中的任意一个URL,当网页爬取模块20对该URL执行一次爬取之后,链接过滤模块30则滤除该URL,以使网页爬取模块20后续不再爬取该URL。
又如,对于某些URL,URL限制策略为每隔预设时长执行一次爬取。相应地,对于这些URL中的任意一个URL,当网页爬取模块20对该URL执行一次爬取之后,在预设时长内链接过滤模块30滤除该URL,即一段时间内网页爬取模块20不再爬取该URL。
可选地,网页爬取装置还可以包括存储模块,存储模块设置为:将解析过后的结构化数据存储到文件系统中。可选地,若一次抓取的数据过大则分大于一个文件进行存储,如每个文件的最大容量可为10M(最大容量可修改),方便后续数据文件的处理。
在某些实施例中,前述解析插件可以包括出厂预置的通用解析插件。
在一可选实施例中,所述装置还可以包括插件开发模块,所述插件开发模块设置为:接收用户对通用解析插件进行二次开发的指令,生成自定义解析插件。
例如,用户有特殊需求,要解析出文章、作者、发表时间、日期等特定信息,则用户可以通过插件开发模块在线编辑通用解析插件获取自定义解析插件,网页爬取模块20则可以加载该自定义解析插件,按照用户要求解析网页信息,将网页信息解析成用户需要的结构化数据,从而实现根据用户要求对网站数据进行精确爬取。
从而,本公开实施例的网页爬取装置,通过对爬取到的外链进行控制, 有效过滤了无关网站,减少了网站爬取数据,更大程度上定位到有用信息的爬取,既提高了爬取效率,减少了无用的杂质数据,进而降低了对存储空间的要求,又大大减少了对带宽的占用。
实施例三
参见图4,提出本公开第三实施例的网页爬取装置,所述装置包括图形用户界面模块100、基础支撑模块200、插件开发模块300、爬取模块400和存储模块500,其中:
基础支撑模块200:设置为提供网页爬取的基础服务,包括各种配置、管理及监控类服务。基础支撑模块200与用户进行交互,用户可通过交互式方式对任务进行操作,系统支持多任务同时运行。通过此模块对整个系统进行管理,接收用户配置的目标种子(如目标网站),及各种爬取策略,将接收的用户配置的上述信息保存在配置文件中,供后续爬取使用。
基础支撑模块200可包括配置模块10和监管模块,该配置模块10与第二实施例中的配置模块10相同,在此不赘述。监管模块设置为对爬取任务进行监控和管理,其中:进行任务监控时,监控任务的运行状态,包括是否处于正在运行状态、上次成功执行时间、上次成功执行时长、上次执行失败时间等等,以方便用户实时查看和管理;进行任务管理时,包括添加任务、删除任务、启动任务、停止任务、立即启动任务、查看任务信息等操作,以方便用户对爬取任务进行实时管理。
图形用户界面模块100:设置为为用户提供图形化显示界面,方便用户进行图形化操作,包括爬取任务配置、爬取策略配置、任务监控、任务管理及插件开发的图形化展示和操作,实现用户交互式操作,极大程度上提升易用性。
插件开发模块300:设置为接收用户对通用解析插件进行二次开发的指令,生成自定义解析插件。用户可根据需求在图形化界面上开发用户特有的解析插件。本实施例中的插件开发模块300与第二实施例中的插件开发模块300相同,在此不赘述。
爬取模块400:设置为根据目标网站生成爬取列表,依次爬取爬取列表中目标网站的网页,获取网页中的网站链接;根据URL限制策略过滤网站链 接,以滤除网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中以供网页爬取模块20后续爬取。本实施例中的爬取模块相当于由第二实施例中的网页爬取模块20和链接过滤模块30组合而成,可参见第二实施例中的网页爬取模块20和链接过滤模块30,在此不再赘述。
存储模块500:设置为存储爬取模块爬取的网页信息。当爬取模块对网页信息进行了解析时,将解析过后的结构化数据存储到文件系统中。可选地,若一次抓取的数据过大则分大于一个文件进行存储,如每个文件最大容量可为10M(最大容量可修改),方便后续数据文件的处理。
如图5所示,采用本实施例的网页爬取装置进行网页爬取时,可包括以下流程:
步骤101:当用户进行爬取任务配置、爬取策略配置、任务管理等操作时,图形用户界面模块下发操作命令给基础支撑模块,基础支撑模块解析操作命令,并进行相应的处理。
步骤102:基础支撑模块模块对用户的操作命令进行相应处理后,将操作结果返回用户,并保存信息,如配置等操作信息。
步骤103:当用户在线进行插件开发编辑后,图形用户界面发送操作命令给插件开发模块,插件开发模块解析操作命令,并进行相应的处理。
步骤104:插件开发模块将用户开发的解析插件生成为自定义解析插件,供后面解析网页使用,保存信息,并将操作结果返回给图形用户界面以显示给用户。
步骤105:用户通过图形用户界面模块向爬取模块发出立即启动任务命令,爬取模块做出相应反应。
步骤106:当配置的任务启动时间到时,爬取模块做出相应反应。
步骤107:当接收到立即启动任务命令时或者任务启动时间到达时,爬取模块启动爬取任务,对网页进行爬取,解析网页,并将过滤后的外链加入到待爬取网页库(如爬取列表)。
步骤108:爬取模块爬取完成后,下发存储命令给存储模块,通知其存储数据。
步骤109:存储模块接收到存储命令后,将网页结构化后的数据存储在文件中,根据数据大小分文件存储。
步骤110:存储模块存储完成后,返回爬取结果给图形用户界面,以通过给图形用户界面告知用户所有操作完成,并更新任务状态。
本公开实施例的网页爬取装置,通过配置URL限制策略,根据URL限制策略过滤爬取的网页中的网站链接,以滤除网站链接中的无效链接,将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中以供后续爬取。从而有效过滤了无关网站,减少了网站爬取数据,更大程度上定位到有用信息的爬取,既提高了爬取效率,减少了无用的杂质数据,进而降低了对存储空间的要求,又大大减少了对带宽的占用。
同时,利用解析插件对抓取的网页信息进行去噪处理,提取出网页信息中的有效内容予以存储,从而大大降低了对存储空间的要求,并减小了杂质数据的干扰,为用户后期数据提取降低了难度。并且,允许用户对通用解析插件进行二次开发生成自定义解析插件,利用自定义解析插件解析网页信息,实现了对网站数据进行精确爬取,并满足了用户的个性化需求。
本公开实施例的网页爬取装置,既可设置在单机上也可设置在hadoop集群中。
本公开实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被执行时实现上述网页爬取方法。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加与软件相配合的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,本公开的技术方案本质上或者说对本领域做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本公开不同实施例所述的方法。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以 被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于随机存取存储器(RAM,Random Access Memory)、只读存储器(ROM,Read-Only Memory)、电可擦除只读存储器(EEPROM,Electrically Erasable Programmable Read-only Memory)、闪存或其他存储器技术、光盘只读存储器(CD-ROM,Compact Disc Read-Only Memory)、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
本领域的普通技术人员可以理解,可以对本公开的技术方案进行修改或者等同替换,而不脱离本公开技术方案的精神和范围,均应涵盖在本公开的权利要求范围当中。
工业实用性
本公开实施例的网页爬取方法,通过配置URL限制策略,根据URL限制策略过滤爬取的网页中的网站链接,以滤除网站链接中的无效链接,将过滤后剩余的网站链接作为目标网站的链接加入爬取列表中以供后续爬取。从而有效过滤了无关网站,减少了网站爬取数据,更大程度上定位到有用信息的爬取,既提高了爬取效率,减少了无用的杂质数据,进而降低了对存储空间的要求,又大大减少了对带宽的占用。

Claims (14)

  1. 一种网页爬取方法,包括以下步骤:
    配置爬取任务和爬取策略;所述爬取任务包括目标网站,所述爬取策略包括URL限制策略;
    根据所述目标网站生成爬取列表;
    依次爬取所述爬取列表中目标网站的网页,获取所述网页中的网站链接;
    根据所述URL限制策略过滤所述网站链接,以滤除所述网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入所述爬取列表中以供后续爬取。
  2. 根据权利要求1所述的网页爬取方法,其中,所述URL限制策略包括指定URL只执行一次爬取,或指定URL每隔预设时长执行一次爬取,或指定第一URL只执行一次爬取和指定第二URL每隔预设时长执行一次爬取。
  3. 根据权利要求1所述的网页爬取方法,其中,
    所述爬取策略还包括频率限制策略,所述依次爬取所述爬取列表中目标网站的网页包括:
    根据所述频率限制策略以不同的频率爬取所述目标网站的网页中不同的内容。
  4. 根据权利要求1所述的网页爬取方法,其中,
    所述爬取策略还包括数量限制策略,所述依次爬取所述爬取列表中目标网站的网页包括:
    根据所述数量限制策略爬取所述目标网站的网页中预设数量的指定内容。
  5. 根据权利要求1所述的网页爬取方法,所述爬取任务还包括任务每天启停时间、任务爬取深度和任务每天循环次数及循环间隔时间中的至少一种。
  6. 根据权利要求1至5任一项所述的网页爬取方法,其中,所述依次爬取所述爬取列表中目标网站的网页包括:
    抓取所述目标网站的网页信息;
    根据预设的解析插件对所述网页信息进行去噪处理,提取出所述网页信息中的有效内容并予以存储。
  7. 根据权利要求6所述的网页爬取方法,其中,所述解析插件包括通用解析插件或经用户对通用解析插件进行二次开发后的自定义解析插件。
  8. 一种网页爬取装置,包括:
    配置模块,设置为:配置爬取任务和爬取策略;所述爬取任务包括目标网站,所述爬取策略包括URL限制策略;
    网页爬取模块,设置为:根据所述目标网站生成爬取列表;依次爬取所述爬取列表中目标网站的网页,获取所述网页中的网站链接;
    链接过滤模块,设置为:根据所述URL限制策略过滤所述网站链接,以滤除所述网站链接中的无效链接,并将过滤后剩余的网站链接作为目标网站的链接加入所述爬取列表中以供所述网页爬取模块后续爬取。
  9. 根据权利要求8所述的网页爬取装置,其中,所述URL限制策略包括指定URL只执行一次爬取,或指定URL每隔预设时长执行一次爬取,指定第一URL只执行一次爬取和指定第二URL每隔预设时长执行一次爬取。
  10. 根据权利要求8所述的网页爬取装置,
    所述爬取策略还包括频率限制策略,所述网页爬取模块还设置为:根据所述频率限制策略以不同的频率爬取所述目标网站的网页中不同的内容。
  11. 根据权利要求8所述的网页爬取装置,
    所述爬取策略还包括数量限制策略,所述网页爬取模块还设置为:根据所述数量限制策略爬取所述目标网站的网页中预设数量的指定内容。
  12. 根据权利要求8所述的网页爬取装置,所述爬取任务还包括任务每天启停时间、任务爬取深度和任务每天循环次数及循环间隔时间中的至少一种。
  13. 根据权利要求8至12任一项所述的网页爬取装置,其中,所述网页爬取模块包括:
    抓取单元,设置为:抓取所述目标网站的网页信息;
    解析单元,设置为:根据预设的解析插件对所述网页信息进行去噪处理,提取出所述网页信息中的有效内容并予以存储。
  14. 根据权利要求13所述的网页爬取装置,其中,
    所述装置还包括插件开发模块,所述插件开发模块设置为:接收用户对通用解析插件进行二次开发的指令,生成自定义解析插件。
PCT/CN2018/074262 2017-03-01 2018-01-26 网页爬取方法和装置 WO2018157686A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710117896.5A CN108536691A (zh) 2017-03-01 2017-03-01 网页爬取方法和装置
CN201710117896.5 2017-03-01

Publications (1)

Publication Number Publication Date
WO2018157686A1 true WO2018157686A1 (zh) 2018-09-07

Family

ID=63370576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074262 WO2018157686A1 (zh) 2017-03-01 2018-01-26 网页爬取方法和装置

Country Status (2)

Country Link
CN (1) CN108536691A (zh)
WO (1) WO2018157686A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614536A (zh) * 2018-11-30 2019-04-12 平安科技(深圳)有限公司 基于YouTuBe的视频批量爬取方法、系统、装置及可存储介质
CN109902212A (zh) * 2019-01-25 2019-06-18 中国电子科技集团公司第三十研究所 一种自定义动态扩展的暗网爬虫系统
CN112905867B (zh) * 2019-03-14 2022-06-07 福建省天奕网络科技有限公司 一种高效率的历史数据追溯爬取方法及终端
CN112579859A (zh) * 2019-09-30 2021-03-30 北京国双科技有限公司 无效流量的处理方法及装置、存储介质和设备
CN112417240A (zh) * 2020-02-21 2021-02-26 上海哔哩哔哩科技有限公司 网站链接检测方法、装置、计算机设备
CN113965371B (zh) * 2021-10-19 2023-08-29 北京天融信网络安全技术有限公司 网站监测过程中的任务处理方法、装置、终端及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012142092A1 (en) * 2011-04-11 2012-10-18 Vistaprint Technologies Limited Configurable web crawler
CN104063448A (zh) * 2014-06-18 2014-09-24 华东师范大学 一种视频领域相关的分布式微博数据抓取系统
CN104182412A (zh) * 2013-05-24 2014-12-03 中国移动通信集团安徽有限公司 一种网页爬取方法及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184227B (zh) * 2011-05-10 2013-05-08 北京邮电大学 一种面向web服务的通用爬虫引擎系统及其工作方法
CN102880607A (zh) * 2011-07-15 2013-01-16 舆情(香港)有限公司 网络动态内容抓取方法及网络动态内容爬虫系统
CN103440139A (zh) * 2013-09-11 2013-12-11 北京邮电大学 一种面向主流微博网站微博id的采集方法及工具
CN103902684B (zh) * 2014-03-25 2018-02-23 浪潮电子信息产业股份有限公司 一种爬虫采集内容结构化的方法
US20160055243A1 (en) * 2014-08-22 2016-02-25 Ut Battelle, Llc Web crawler for acquiring content
CN105956175B (zh) * 2016-05-24 2017-09-05 考拉征信服务有限公司 网页内容爬取的方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012142092A1 (en) * 2011-04-11 2012-10-18 Vistaprint Technologies Limited Configurable web crawler
CN104182412A (zh) * 2013-05-24 2014-12-03 中国移动通信集团安徽有限公司 一种网页爬取方法及系统
CN104063448A (zh) * 2014-06-18 2014-09-24 华东师范大学 一种视频领域相关的分布式微博数据抓取系统

Also Published As

Publication number Publication date
CN108536691A (zh) 2018-09-14

Similar Documents

Publication Publication Date Title
WO2018157686A1 (zh) 网页爬取方法和装置
JP6488508B2 (ja) ウェブページのアクセス方法、装置、デバイス及びプログラム
CN106294351A (zh) 日志事件处理方法和装置
CN106126693B (zh) 一种网页的相关数据的发送方法及装置
CN103531218B (zh) 一种在线多媒体文件编辑方法及系统
CN108696488B (zh) 一种上传接口识别方法、识别服务器及系统
CN106656920B (zh) Http服务的处理方法、装置、存储介质及处理器
CN102314469A (zh) 一种实现跨域请求回调的方法
CN107391775A (zh) 一种通用的网络爬虫模型实现方法及系统
WO2019201040A1 (zh) 一种管理更新文件的方法、系统及终端设备
CN108920691B (zh) 前端静态资源的管理方法、装置、计算机设备及存储介质
WO2019153603A1 (zh) 网页爬取的配置方法、应用服务器及计算机可读存储介质
CN110727890A (zh) 一种页面加载方法、装置及计算机设备、存储介质
CN103475688A (zh) 用于下载网站数据的分布式方法和系统
CN106599270B (zh) 网络数据抓取方法和爬虫
CN110390043A (zh) 网页邮箱数据的爬取方法、装置、终端和存储介质
US9942267B1 (en) Endpoint segregation to prevent scripting attacks
CN106649357A (zh) 用于爬虫程序的数据处理方法及装置
CN111125485A (zh) 基于Scrapy的网站URL爬取方法
WO2016029384A1 (zh) 一种资源下载方法、电子设备及装置
CN114095755A (zh) 一种视频处理方法、装置、系统、电子设备及存储介质
CN103354546A (zh) 报文过滤方法与装置
CN110147473B (zh) 一种爬虫的爬取方法及装置
EP3502925B1 (en) Computer system and method for extracting dynamic content from websites
CN106209992A (zh) 一种路由器支持rss订阅任务下载的方法及路由器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18761321

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18761321

Country of ref document: EP

Kind code of ref document: A1