CN104391917A - Method for incrementally capturing webpage contents - Google Patents

Method for incrementally capturing webpage contents Download PDF

Info

Publication number
CN104391917A
CN104391917A CN201410663266.4A CN201410663266A CN104391917A CN 104391917 A CN104391917 A CN 104391917A CN 201410663266 A CN201410663266 A CN 201410663266A CN 104391917 A CN104391917 A CN 104391917A
Authority
CN
China
Prior art keywords
web page
page contents
webpage
basic resource
resource storehouse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410663266.4A
Other languages
Chinese (zh)
Inventor
康钟荣
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201410663266.4A priority Critical patent/CN104391917A/en
Publication of CN104391917A publication Critical patent/CN104391917A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method for incrementally capturing webpage contents. The method comprises the steps of establishing a webpage content basic resource library in a global search traversal manner, next, extracting URLs of webpage contents in a changing update status, adding the URLs and corresponding webpage analysis rules to a distributed task queue, downloading the webpages and creating webpage DOM objects, extracting webpage content metadata according to the analysis rules, and replacing the updated information after comparing the extracted metadata with the webpage contents in the original basic library, or inserting the webpage content information to URLs completely nonexistent in the basic resource library. The method for incrementally capturing the webpage contents does not need to traverse the webpage contents of all the URLs while checking update of the whole resource library, and only needs to traverse the webpage contents under the URLs having update change identifiers, and therefore, incremental capture on the webpage contents is realized, the capture workload is reduced and the update efficiency is improved.

Description

A kind of increment captures the method for web page contents
Technical field
The present invention relates to field of computer technology, particularly a kind of capturing webpage contents method of increment type.
Background technology
At present, crawler technology is widely used in the crawl of web page contents, and magnanimity info web is usually obtained by a kind of web crawlers method of Greedy, blanket type and sets up corresponding resources bank.But in the epoch of internet high speed development, the renewal speed of info web is also exceedingly fast, in this case owing to making the value of existing web page content information resources bank reduce to the requirement of the indexs such as ageing, freshness, corresponding demand can not be met.Existing increment grasping means also therefrom extracts the info web that freshness is higher compared with existing resource often by the webpage that traversal is all, this method efficiency is very low, require higher to time, space and the network bandwidth etc., in this case, a kind of web page contents increment grasping means efficiently seems particularly important.
Summary of the invention
In order to overcome the above-mentioned shortcoming of prior art, the invention provides a kind of increment and capturing the method for web page contents, can when efficiently, lower workload complete capturing webpage contents work.
The technical solution adopted for the present invention to solve the technical problems is: a kind of increment captures the method for web page contents, comprises the steps:
Step one, set up web page contents basic resource storehouse;
Step 2, identify from web page contents basic resource storehouse and filter out the webpage being in and continuing to change more new state, by web page address URL stored in database table, defining web analysis rule simultaneously;
Step 3, dispose some thread Multi-task Concurrency and realize increment and capture web page contents metadata;
Step 4, incremental update is carried out to web page contents basic resource storehouse.
Compared with prior art, good effect of the present invention is: the mode that first the present invention is traveled through by global search sets up web page contents basic resource storehouse, then extract web page contents and be in the URL changing more new state, and these URL and corresponding web analysis rule are joined in distributed task queue, download these webpages and create webpage DOM object, extract web page contents metadata according to resolution rules and replace the information upgraded after contrasting with the web page contents in former base library, or complete non-existent URL inserts its web page content information in basic resource storehouse.The present invention does not need to travel through the web page contents of all URL when carrying out inspection to whole resources bank and upgrading, only need travel through those have more new change mark URL under web page contents, the increment achieving web page contents captures, and decreases crawl workload, improves renewal efficiency.
Accompanying drawing explanation
Examples of the present invention will be described by way of reference to the accompanying drawings, wherein:
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
In the present invention, increment type captures the method for web page contents, on the basis establishing web page contents resources bank, extract the webpage URL being in and changing more new state, from the webpage downloaded, capture out content metadata again and corresponding contents in incremental update resources bank, or insert non-existent web page contents in former resources bank.
In the present invention, the interpolation to web page content information under the renewal of existing web page content information and brand-new URL is included to the incremental update in basic resource storehouse.
As shown in Figure 1, the inventive method comprises the steps:
Step S101, sets up web page contents basic resource storehouse:
First all webpages of requirement definition are met under traveling through seed URL in the mode of global search traversal, download these webpages and according to the resolution rules meeting these webpage features, they resolved, extract web page contents metadata, thus these web page contents metadata are inserted in the relational database preset, set up web page contents basic resource storehouse, wherein the corresponding uniquely identified ID of each specific webpage.
Step S102, identifies and filters out the webpage being in and continuing to change more new state from web page contents basic resource storehouse, and definition meets the resolution rules of web page characteristics and demand:
According to feature and the correlated identities of web page contents in web page contents basic resource storehouse, filter out web page contents and be in the webpage corresponding address URL changing more new state, and stored in corresponding database table.Need to identify whether the content of webpage is in change renewal, capture the need of follow-up tracking and content increment and upgrade, if it is the address URL of this webpage is joined in corresponding database table, this table is only for storing the web page address URL needing to upgrade, the web page content information of such as video class, the broadcast address of the normally video that change upgrades, as the collection of TV plays etc. published in instalments.
From the database table storing content change web page address URL, take out address URL, and definition meets the web page contents metadata resolution rules of web page contents feature.Web analysis rule be for webpage particular type, define according to page layout format, structure design characteristic, the importance degree of content information and user's request etc.
Step S103, disposes some thread Multi-task Concurrency and realizes increment crawl web page contents metadata:
According to capture webpage Scaledeployment some thread with Multi-task Concurrency realize increment capture, by the resolution rules of definition and take out from database table be in change upgrade in webpage URL join in thread task queue; In each thread task queue, download webpage corresponding to all URL and create DOM object, then according to resolution rules, described DOM object is resolved, and therefrom extract required web page contents metadata, for video class info web, Web Page Metadata comprises: video title, uplink time, video type, upload author, performer, countries and regions, spectators scoring etc.
Step S104, carries out incremental update to web page contents basic resource storehouse:
Web page content information in the web page contents metadata extracted and basic resource storehouse is contrasted, if any information record under there is not webpage URL in basic resource storehouse, then direct by all the elements metadata insertion base library under this webpage URL; If there is the partial information under webpage URL in base library, then the renewal part of the web page contents metadata extracted is carried out renewal rewards theory in base library.So far, this web page contents increment captures task and completes, and developer only need regularly dispose according to the renewal frequency of web page contents and feature the work that increment captures.

Claims (6)

1. increment captures a method for web page contents, it is characterized in that: comprise the steps:
Step one, set up web page contents basic resource storehouse;
Step 2, identify from web page contents basic resource storehouse and filter out the webpage being in and continuing to change more new state, by web page address URL stored in database table, defining web analysis rule simultaneously;
Step 3, dispose some thread Multi-task Concurrency and realize increment and capture web page contents metadata;
Step 4, incremental update is carried out to web page contents basic resource storehouse.
2. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: the method setting up web page contents basic resource storehouse described in step one is: all webpages meeting requirement definition under traveling through seed URL in the mode of global search traversal, download, resolve and extract web page contents metadata, then insert in the relational database preset, set up web page contents basic resource storehouse.
3. a kind of increment according to claim 2 captures the method for web page contents, it is characterized in that: in web page contents basic resource storehouse, the corresponding uniquely identified ID of each specific webpage.
4. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: described web analysis rule defines for the importance degree of the particular type of webpage, foundation page layout format, structure design characteristic, content information and user's request etc.
5. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: dispose described in step 3 some thread Multi-task Concurrency realize method that increment captures web page contents metadata for: according to the thread of Scaledeployment some that will capture webpage, by web analysis rule and take out from database table be in change upgrade in web page address URL join in thread task queue; In each thread task queue, download webpage corresponding to all URL and create DOM object, then according to web analysis rule, DOM object being resolved, and therefrom extract required web page contents metadata.
6. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: described in step 4 to the method that incremental update is carried out in web page contents basic resource storehouse be: the web page content information in web page contents metadata step 3 captured out and basic resource storehouse contrasts, if any information record under there is not webpage URL in basic resource storehouse, then all the elements metadata under this webpage URL is inserted in basic resource storehouse; If there is the partial information under webpage URL in basic resource storehouse, then the renewal part of the web page contents metadata extracted is carried out renewal rewards theory in basic resource storehouse.
CN201410663266.4A 2014-11-19 2014-11-19 Method for incrementally capturing webpage contents Pending CN104391917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410663266.4A CN104391917A (en) 2014-11-19 2014-11-19 Method for incrementally capturing webpage contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410663266.4A CN104391917A (en) 2014-11-19 2014-11-19 Method for incrementally capturing webpage contents

Publications (1)

Publication Number Publication Date
CN104391917A true CN104391917A (en) 2015-03-04

Family

ID=52609821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410663266.4A Pending CN104391917A (en) 2014-11-19 2014-11-19 Method for incrementally capturing webpage contents

Country Status (1)

Country Link
CN (1) CN104391917A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068912A (en) * 2015-07-29 2015-11-18 北京京东尚科信息技术有限公司 Method and apparatus for executing webpage task
CN105631044A (en) * 2016-01-29 2016-06-01 四川长虹电器股份有限公司 Convergence method of network video resources
CN105677862A (en) * 2016-01-08 2016-06-15 上海数道信息科技有限公司 Method and device for grabbing webpage content
CN105893571A (en) * 2016-03-31 2016-08-24 乐视控股(北京)有限公司 Method and system for establishing content tag of video
CN107257390A (en) * 2017-05-27 2017-10-17 北京思特奇信息技术股份有限公司 A kind of parsing method and system of URL addresses
CN107547912A (en) * 2017-09-01 2018-01-05 深圳创维数字技术有限公司 A kind of method for processing resource, system and the storage medium of full matchmaker's money
CN107943869A (en) * 2017-11-10 2018-04-20 深圳市华阅文化传媒有限公司 The method and apparatus for reading third party's webpage
CN110362393A (en) * 2019-07-18 2019-10-22 北京明略软件系统有限公司 The detection method and device of increment task
CN112612774A (en) * 2020-12-17 2021-04-06 武汉达梦数据技术有限公司 Metadata analysis method and device based on page and database comparison

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009147055A1 (en) * 2008-06-05 2009-12-10 International Business Machines Corporation Incremental crawling of multiple content providers using aggregation
US20100205168A1 (en) * 2009-02-10 2010-08-12 Microsoft Corporation Thread-Based Incremental Web Forum Crawling
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN102103636A (en) * 2011-01-18 2011-06-22 南京信息工程大学 Deep web-oriented incremental information acquisition method
CN102254027A (en) * 2011-07-29 2011-11-23 四川长虹电器股份有限公司 Method for obtaining webpage contents in batch
CN102567407A (en) * 2010-12-22 2012-07-11 北大方正集团有限公司 Method and system for collecting forum reply increment
CN102937989A (en) * 2012-10-29 2013-02-20 北京腾逸科技发展有限公司 Parallel distributed internet data capture method and system
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103970787A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Incremental updating and crawling technology

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009147055A1 (en) * 2008-06-05 2009-12-10 International Business Machines Corporation Incremental crawling of multiple content providers using aggregation
US20100205168A1 (en) * 2009-02-10 2010-08-12 Microsoft Corporation Thread-Based Incremental Web Forum Crawling
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN102567407A (en) * 2010-12-22 2012-07-11 北大方正集团有限公司 Method and system for collecting forum reply increment
CN102103636A (en) * 2011-01-18 2011-06-22 南京信息工程大学 Deep web-oriented incremental information acquisition method
CN102254027A (en) * 2011-07-29 2011-11-23 四川长虹电器股份有限公司 Method for obtaining webpage contents in batch
CN102937989A (en) * 2012-10-29 2013-02-20 北京腾逸科技发展有限公司 Parallel distributed internet data capture method and system
CN103970787A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Incremental updating and crawling technology
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068912A (en) * 2015-07-29 2015-11-18 北京京东尚科信息技术有限公司 Method and apparatus for executing webpage task
CN105068912B (en) * 2015-07-29 2020-05-01 北京京东尚科信息技术有限公司 Method and device for executing webpage task
CN105677862A (en) * 2016-01-08 2016-06-15 上海数道信息科技有限公司 Method and device for grabbing webpage content
CN105631044A (en) * 2016-01-29 2016-06-01 四川长虹电器股份有限公司 Convergence method of network video resources
CN105893571A (en) * 2016-03-31 2016-08-24 乐视控股(北京)有限公司 Method and system for establishing content tag of video
CN107257390A (en) * 2017-05-27 2017-10-17 北京思特奇信息技术股份有限公司 A kind of parsing method and system of URL addresses
CN107257390B (en) * 2017-05-27 2020-10-09 北京思特奇信息技术股份有限公司 URL address resolution method and system
CN107547912A (en) * 2017-09-01 2018-01-05 深圳创维数字技术有限公司 A kind of method for processing resource, system and the storage medium of full matchmaker's money
CN107943869A (en) * 2017-11-10 2018-04-20 深圳市华阅文化传媒有限公司 The method and apparatus for reading third party's webpage
CN110362393A (en) * 2019-07-18 2019-10-22 北京明略软件系统有限公司 The detection method and device of increment task
CN112612774A (en) * 2020-12-17 2021-04-06 武汉达梦数据技术有限公司 Metadata analysis method and device based on page and database comparison
CN112612774B (en) * 2020-12-17 2022-07-26 武汉达梦数据技术有限公司 Metadata analysis method and device based on page and database comparison

Similar Documents

Publication Publication Date Title
CN104391917A (en) Method for incrementally capturing webpage contents
CN102737116B (en) A kind of web page resources store method and device
CN101594241B (en) Method and system for downloading network video
CN105488187A (en) Method and device for extracting multi-source heterogeneous data increment
CN105045838A (en) Network crawler system based on distributed storage system
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN105631044A (en) Convergence method of network video resources
CN103714116A (en) Webpage information extracting method and webpage information extracting equipment
CN103823907A (en) Method, device and engine for integrating on-line video resource addresses
CN102254027A (en) Method for obtaining webpage contents in batch
CN102609412A (en) RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system
CN104598536B (en) A kind of distributed network information structuring processing method
CN102141926A (en) Application program management method, device and digital television set top box
CN103365967A (en) Automatic difference detection method and device based on crawler
CN103761257A (en) Webpage handling method and system based on mobile browser
CN103970787A (en) Incremental updating and crawling technology
CN104765823A (en) Method and device for collecting website data
CN104317857A (en) House information acquisition service system
CN105069032A (en) Filtering expression and rendering engine based method for automatically monitoring update of dynamic webpage
CN104778252A (en) Index storage method and index storage device
CN103870495A (en) Method and device for extracting information from website
CN103744944A (en) Method for re-filtering in webpage or data crawling by web crawler
CN104156458A (en) Information extraction method and device
US20150269277A1 (en) Storing method and apparatus for data aquisition
CN102902791B (en) Web page classification storage system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150304