CN104391917A

CN104391917A - Method for incrementally capturing webpage contents

Info

Publication number: CN104391917A
Application number: CN201410663266.4A
Authority: CN
Inventors: 康钟荣; 李强
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2014-11-19
Filing date: 2014-11-19
Publication date: 2015-03-04

Abstract

The invention discloses a method for incrementally capturing webpage contents. The method comprises the steps of establishing a webpage content basic resource library in a global search traversal manner, next, extracting URLs of webpage contents in a changing update status, adding the URLs and corresponding webpage analysis rules to a distributed task queue, downloading the webpages and creating webpage DOM objects, extracting webpage content metadata according to the analysis rules, and replacing the updated information after comparing the extracted metadata with the webpage contents in the original basic library, or inserting the webpage content information to URLs completely nonexistent in the basic resource library. The method for incrementally capturing the webpage contents does not need to traverse the webpage contents of all the URLs while checking update of the whole resource library, and only needs to traverse the webpage contents under the URLs having update change identifiers, and therefore, incremental capture on the webpage contents is realized, the capture workload is reduced and the update efficiency is improved.

Description

A kind of increment captures the method for web page contents

Technical field

The present invention relates to field of computer technology, particularly a kind of capturing webpage contents method of increment type.

Background technology

At present, crawler technology is widely used in the crawl of web page contents, and magnanimity info web is usually obtained by a kind of web crawlers method of Greedy, blanket type and sets up corresponding resources bank.But in the epoch of internet high speed development, the renewal speed of info web is also exceedingly fast, in this case owing to making the value of existing web page content information resources bank reduce to the requirement of the indexs such as ageing, freshness, corresponding demand can not be met.Existing increment grasping means also therefrom extracts the info web that freshness is higher compared with existing resource often by the webpage that traversal is all, this method efficiency is very low, require higher to time, space and the network bandwidth etc., in this case, a kind of web page contents increment grasping means efficiently seems particularly important.

Summary of the invention

In order to overcome the above-mentioned shortcoming of prior art, the invention provides a kind of increment and capturing the method for web page contents, can when efficiently, lower workload complete capturing webpage contents work.

The technical solution adopted for the present invention to solve the technical problems is: a kind of increment captures the method for web page contents, comprises the steps:

Step one, set up web page contents basic resource storehouse;

Step 2, identify from web page contents basic resource storehouse and filter out the webpage being in and continuing to change more new state, by web page address URL stored in database table, defining web analysis rule simultaneously;

Step 3, dispose some thread Multi-task Concurrency and realize increment and capture web page contents metadata;

Step 4, incremental update is carried out to web page contents basic resource storehouse.

Compared with prior art, good effect of the present invention is: the mode that first the present invention is traveled through by global search sets up web page contents basic resource storehouse, then extract web page contents and be in the URL changing more new state, and these URL and corresponding web analysis rule are joined in distributed task queue, download these webpages and create webpage DOM object, extract web page contents metadata according to resolution rules and replace the information upgraded after contrasting with the web page contents in former base library, or complete non-existent URL inserts its web page content information in basic resource storehouse.The present invention does not need to travel through the web page contents of all URL when carrying out inspection to whole resources bank and upgrading, only need travel through those have more new change mark URL under web page contents, the increment achieving web page contents captures, and decreases crawl workload, improves renewal efficiency.

Accompanying drawing explanation

Examples of the present invention will be described by way of reference to the accompanying drawings, wherein:

Fig. 1 is the process flow diagram of the inventive method.

Embodiment

In the present invention, increment type captures the method for web page contents, on the basis establishing web page contents resources bank, extract the webpage URL being in and changing more new state, from the webpage downloaded, capture out content metadata again and corresponding contents in incremental update resources bank, or insert non-existent web page contents in former resources bank.

In the present invention, the interpolation to web page content information under the renewal of existing web page content information and brand-new URL is included to the incremental update in basic resource storehouse.

As shown in Figure 1, the inventive method comprises the steps:

Step S101, sets up web page contents basic resource storehouse:

First all webpages of requirement definition are met under traveling through seed URL in the mode of global search traversal, download these webpages and according to the resolution rules meeting these webpage features, they resolved, extract web page contents metadata, thus these web page contents metadata are inserted in the relational database preset, set up web page contents basic resource storehouse, wherein the corresponding uniquely identified ID of each specific webpage.

Step S102, identifies and filters out the webpage being in and continuing to change more new state from web page contents basic resource storehouse, and definition meets the resolution rules of web page characteristics and demand:

According to feature and the correlated identities of web page contents in web page contents basic resource storehouse, filter out web page contents and be in the webpage corresponding address URL changing more new state, and stored in corresponding database table.Need to identify whether the content of webpage is in change renewal, capture the need of follow-up tracking and content increment and upgrade, if it is the address URL of this webpage is joined in corresponding database table, this table is only for storing the web page address URL needing to upgrade, the web page content information of such as video class, the broadcast address of the normally video that change upgrades, as the collection of TV plays etc. published in instalments.

From the database table storing content change web page address URL, take out address URL, and definition meets the web page contents metadata resolution rules of web page contents feature.Web analysis rule be for webpage particular type, define according to page layout format, structure design characteristic, the importance degree of content information and user's request etc.

Step S103, disposes some thread Multi-task Concurrency and realizes increment crawl web page contents metadata:

According to capture webpage Scaledeployment some thread with Multi-task Concurrency realize increment capture, by the resolution rules of definition and take out from database table be in change upgrade in webpage URL join in thread task queue; In each thread task queue, download webpage corresponding to all URL and create DOM object, then according to resolution rules, described DOM object is resolved, and therefrom extract required web page contents metadata, for video class info web, Web Page Metadata comprises: video title, uplink time, video type, upload author, performer, countries and regions, spectators scoring etc.

Step S104, carries out incremental update to web page contents basic resource storehouse:

Web page content information in the web page contents metadata extracted and basic resource storehouse is contrasted, if any information record under there is not webpage URL in basic resource storehouse, then direct by all the elements metadata insertion base library under this webpage URL; If there is the partial information under webpage URL in base library, then the renewal part of the web page contents metadata extracted is carried out renewal rewards theory in base library.So far, this web page contents increment captures task and completes, and developer only need regularly dispose according to the renewal frequency of web page contents and feature the work that increment captures.

Claims

1. increment captures a method for web page contents, it is characterized in that: comprise the steps:

Step one, set up web page contents basic resource storehouse;

2. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: the method setting up web page contents basic resource storehouse described in step one is: all webpages meeting requirement definition under traveling through seed URL in the mode of global search traversal, download, resolve and extract web page contents metadata, then insert in the relational database preset, set up web page contents basic resource storehouse.

3. a kind of increment according to claim 2 captures the method for web page contents, it is characterized in that: in web page contents basic resource storehouse, the corresponding uniquely identified ID of each specific webpage.

4. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: described web analysis rule defines for the importance degree of the particular type of webpage, foundation page layout format, structure design characteristic, content information and user's request etc.

5. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: dispose described in step 3 some thread Multi-task Concurrency realize method that increment captures web page contents metadata for: according to the thread of Scaledeployment some that will capture webpage, by web analysis rule and take out from database table be in change upgrade in web page address URL join in thread task queue; In each thread task queue, download webpage corresponding to all URL and create DOM object, then according to web analysis rule, DOM object being resolved, and therefrom extract required web page contents metadata.

6. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: described in step 4 to the method that incremental update is carried out in web page contents basic resource storehouse be: the web page content information in web page contents metadata step 3 captured out and basic resource storehouse contrasts, if any information record under there is not webpage URL in basic resource storehouse, then all the elements metadata under this webpage URL is inserted in basic resource storehouse; If there is the partial information under webpage URL in basic resource storehouse, then the renewal part of the web page contents metadata extracted is carried out renewal rewards theory in basic resource storehouse.