CN104391917A - Method for incrementally capturing webpage contents - Google Patents
Method for incrementally capturing webpage contents Download PDFInfo
- Publication number
- CN104391917A CN104391917A CN201410663266.4A CN201410663266A CN104391917A CN 104391917 A CN104391917 A CN 104391917A CN 201410663266 A CN201410663266 A CN 201410663266A CN 104391917 A CN104391917 A CN 104391917A
- Authority
- CN
- China
- Prior art keywords
- web page
- page contents
- webpage
- basic resource
- resource storehouse
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a method for incrementally capturing webpage contents. The method comprises the steps of establishing a webpage content basic resource library in a global search traversal manner, next, extracting URLs of webpage contents in a changing update status, adding the URLs and corresponding webpage analysis rules to a distributed task queue, downloading the webpages and creating webpage DOM objects, extracting webpage content metadata according to the analysis rules, and replacing the updated information after comparing the extracted metadata with the webpage contents in the original basic library, or inserting the webpage content information to URLs completely nonexistent in the basic resource library. The method for incrementally capturing the webpage contents does not need to traverse the webpage contents of all the URLs while checking update of the whole resource library, and only needs to traverse the webpage contents under the URLs having update change identifiers, and therefore, incremental capture on the webpage contents is realized, the capture workload is reduced and the update efficiency is improved.
Description
Technical field
The present invention relates to field of computer technology, particularly a kind of capturing webpage contents method of increment type.
Background technology
At present, crawler technology is widely used in the crawl of web page contents, and magnanimity info web is usually obtained by a kind of web crawlers method of Greedy, blanket type and sets up corresponding resources bank.But in the epoch of internet high speed development, the renewal speed of info web is also exceedingly fast, in this case owing to making the value of existing web page content information resources bank reduce to the requirement of the indexs such as ageing, freshness, corresponding demand can not be met.Existing increment grasping means also therefrom extracts the info web that freshness is higher compared with existing resource often by the webpage that traversal is all, this method efficiency is very low, require higher to time, space and the network bandwidth etc., in this case, a kind of web page contents increment grasping means efficiently seems particularly important.
Summary of the invention
In order to overcome the above-mentioned shortcoming of prior art, the invention provides a kind of increment and capturing the method for web page contents, can when efficiently, lower workload complete capturing webpage contents work.
The technical solution adopted for the present invention to solve the technical problems is: a kind of increment captures the method for web page contents, comprises the steps:
Step one, set up web page contents basic resource storehouse;
Step 2, identify from web page contents basic resource storehouse and filter out the webpage being in and continuing to change more new state, by web page address URL stored in database table, defining web analysis rule simultaneously;
Step 3, dispose some thread Multi-task Concurrency and realize increment and capture web page contents metadata;
Step 4, incremental update is carried out to web page contents basic resource storehouse.
Compared with prior art, good effect of the present invention is: the mode that first the present invention is traveled through by global search sets up web page contents basic resource storehouse, then extract web page contents and be in the URL changing more new state, and these URL and corresponding web analysis rule are joined in distributed task queue, download these webpages and create webpage DOM object, extract web page contents metadata according to resolution rules and replace the information upgraded after contrasting with the web page contents in former base library, or complete non-existent URL inserts its web page content information in basic resource storehouse.The present invention does not need to travel through the web page contents of all URL when carrying out inspection to whole resources bank and upgrading, only need travel through those have more new change mark URL under web page contents, the increment achieving web page contents captures, and decreases crawl workload, improves renewal efficiency.
Accompanying drawing explanation
Examples of the present invention will be described by way of reference to the accompanying drawings, wherein:
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
In the present invention, increment type captures the method for web page contents, on the basis establishing web page contents resources bank, extract the webpage URL being in and changing more new state, from the webpage downloaded, capture out content metadata again and corresponding contents in incremental update resources bank, or insert non-existent web page contents in former resources bank.
In the present invention, the interpolation to web page content information under the renewal of existing web page content information and brand-new URL is included to the incremental update in basic resource storehouse.
As shown in Figure 1, the inventive method comprises the steps:
Step S101, sets up web page contents basic resource storehouse:
First all webpages of requirement definition are met under traveling through seed URL in the mode of global search traversal, download these webpages and according to the resolution rules meeting these webpage features, they resolved, extract web page contents metadata, thus these web page contents metadata are inserted in the relational database preset, set up web page contents basic resource storehouse, wherein the corresponding uniquely identified ID of each specific webpage.
Step S102, identifies and filters out the webpage being in and continuing to change more new state from web page contents basic resource storehouse, and definition meets the resolution rules of web page characteristics and demand:
According to feature and the correlated identities of web page contents in web page contents basic resource storehouse, filter out web page contents and be in the webpage corresponding address URL changing more new state, and stored in corresponding database table.Need to identify whether the content of webpage is in change renewal, capture the need of follow-up tracking and content increment and upgrade, if it is the address URL of this webpage is joined in corresponding database table, this table is only for storing the web page address URL needing to upgrade, the web page content information of such as video class, the broadcast address of the normally video that change upgrades, as the collection of TV plays etc. published in instalments.
From the database table storing content change web page address URL, take out address URL, and definition meets the web page contents metadata resolution rules of web page contents feature.Web analysis rule be for webpage particular type, define according to page layout format, structure design characteristic, the importance degree of content information and user's request etc.
Step S103, disposes some thread Multi-task Concurrency and realizes increment crawl web page contents metadata:
According to capture webpage Scaledeployment some thread with Multi-task Concurrency realize increment capture, by the resolution rules of definition and take out from database table be in change upgrade in webpage URL join in thread task queue; In each thread task queue, download webpage corresponding to all URL and create DOM object, then according to resolution rules, described DOM object is resolved, and therefrom extract required web page contents metadata, for video class info web, Web Page Metadata comprises: video title, uplink time, video type, upload author, performer, countries and regions, spectators scoring etc.
Step S104, carries out incremental update to web page contents basic resource storehouse:
Web page content information in the web page contents metadata extracted and basic resource storehouse is contrasted, if any information record under there is not webpage URL in basic resource storehouse, then direct by all the elements metadata insertion base library under this webpage URL; If there is the partial information under webpage URL in base library, then the renewal part of the web page contents metadata extracted is carried out renewal rewards theory in base library.So far, this web page contents increment captures task and completes, and developer only need regularly dispose according to the renewal frequency of web page contents and feature the work that increment captures.
Claims (6)
1. increment captures a method for web page contents, it is characterized in that: comprise the steps:
Step one, set up web page contents basic resource storehouse;
Step 2, identify from web page contents basic resource storehouse and filter out the webpage being in and continuing to change more new state, by web page address URL stored in database table, defining web analysis rule simultaneously;
Step 3, dispose some thread Multi-task Concurrency and realize increment and capture web page contents metadata;
Step 4, incremental update is carried out to web page contents basic resource storehouse.
2. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: the method setting up web page contents basic resource storehouse described in step one is: all webpages meeting requirement definition under traveling through seed URL in the mode of global search traversal, download, resolve and extract web page contents metadata, then insert in the relational database preset, set up web page contents basic resource storehouse.
3. a kind of increment according to claim 2 captures the method for web page contents, it is characterized in that: in web page contents basic resource storehouse, the corresponding uniquely identified ID of each specific webpage.
4. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: described web analysis rule defines for the importance degree of the particular type of webpage, foundation page layout format, structure design characteristic, content information and user's request etc.
5. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: dispose described in step 3 some thread Multi-task Concurrency realize method that increment captures web page contents metadata for: according to the thread of Scaledeployment some that will capture webpage, by web analysis rule and take out from database table be in change upgrade in web page address URL join in thread task queue; In each thread task queue, download webpage corresponding to all URL and create DOM object, then according to web analysis rule, DOM object being resolved, and therefrom extract required web page contents metadata.
6. a kind of increment according to claim 1 captures the method for web page contents, it is characterized in that: described in step 4 to the method that incremental update is carried out in web page contents basic resource storehouse be: the web page content information in web page contents metadata step 3 captured out and basic resource storehouse contrasts, if any information record under there is not webpage URL in basic resource storehouse, then all the elements metadata under this webpage URL is inserted in basic resource storehouse; If there is the partial information under webpage URL in basic resource storehouse, then the renewal part of the web page contents metadata extracted is carried out renewal rewards theory in basic resource storehouse.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410663266.4A CN104391917A (en) | 2014-11-19 | 2014-11-19 | Method for incrementally capturing webpage contents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410663266.4A CN104391917A (en) | 2014-11-19 | 2014-11-19 | Method for incrementally capturing webpage contents |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104391917A true CN104391917A (en) | 2015-03-04 |
Family
ID=52609821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410663266.4A Pending CN104391917A (en) | 2014-11-19 | 2014-11-19 | Method for incrementally capturing webpage contents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104391917A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105068912A (en) * | 2015-07-29 | 2015-11-18 | 北京京东尚科信息技术有限公司 | Method and apparatus for executing webpage task |
CN105631044A (en) * | 2016-01-29 | 2016-06-01 | 四川长虹电器股份有限公司 | Convergence method of network video resources |
CN105677862A (en) * | 2016-01-08 | 2016-06-15 | 上海数道信息科技有限公司 | Method and device for grabbing webpage content |
CN105893571A (en) * | 2016-03-31 | 2016-08-24 | 乐视控股(北京)有限公司 | Method and system for establishing content tag of video |
CN107257390A (en) * | 2017-05-27 | 2017-10-17 | 北京思特奇信息技术股份有限公司 | A kind of parsing method and system of URL addresses |
CN107547912A (en) * | 2017-09-01 | 2018-01-05 | 深圳创维数字技术有限公司 | A kind of method for processing resource, system and the storage medium of full matchmaker's money |
CN107943869A (en) * | 2017-11-10 | 2018-04-20 | 深圳市华阅文化传媒有限公司 | The method and apparatus for reading third party's webpage |
CN110362393A (en) * | 2019-07-18 | 2019-10-22 | 北京明略软件系统有限公司 | The detection method and device of increment task |
CN112612774A (en) * | 2020-12-17 | 2021-04-06 | 武汉达梦数据技术有限公司 | Metadata analysis method and device based on page and database comparison |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009147055A1 (en) * | 2008-06-05 | 2009-12-10 | International Business Machines Corporation | Incremental crawling of multiple content providers using aggregation |
US20100205168A1 (en) * | 2009-02-10 | 2010-08-12 | Microsoft Corporation | Thread-Based Incremental Web Forum Crawling |
CN101937469A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Information capture method of video website |
CN102103636A (en) * | 2011-01-18 | 2011-06-22 | 南京信息工程大学 | Deep web-oriented incremental information acquisition method |
CN102254027A (en) * | 2011-07-29 | 2011-11-23 | 四川长虹电器股份有限公司 | Method for obtaining webpage contents in batch |
CN102567407A (en) * | 2010-12-22 | 2012-07-11 | 北大方正集团有限公司 | Method and system for collecting forum reply increment |
CN102937989A (en) * | 2012-10-29 | 2013-02-20 | 北京腾逸科技发展有限公司 | Parallel distributed internet data capture method and system |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103970787A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Incremental updating and crawling technology |
-
2014
- 2014-11-19 CN CN201410663266.4A patent/CN104391917A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009147055A1 (en) * | 2008-06-05 | 2009-12-10 | International Business Machines Corporation | Incremental crawling of multiple content providers using aggregation |
US20100205168A1 (en) * | 2009-02-10 | 2010-08-12 | Microsoft Corporation | Thread-Based Incremental Web Forum Crawling |
CN101937469A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Information capture method of video website |
CN102567407A (en) * | 2010-12-22 | 2012-07-11 | 北大方正集团有限公司 | Method and system for collecting forum reply increment |
CN102103636A (en) * | 2011-01-18 | 2011-06-22 | 南京信息工程大学 | Deep web-oriented incremental information acquisition method |
CN102254027A (en) * | 2011-07-29 | 2011-11-23 | 四川长虹电器股份有限公司 | Method for obtaining webpage contents in batch |
CN102937989A (en) * | 2012-10-29 | 2013-02-20 | 北京腾逸科技发展有限公司 | Parallel distributed internet data capture method and system |
CN103970787A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Incremental updating and crawling technology |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105068912A (en) * | 2015-07-29 | 2015-11-18 | 北京京东尚科信息技术有限公司 | Method and apparatus for executing webpage task |
CN105068912B (en) * | 2015-07-29 | 2020-05-01 | 北京京东尚科信息技术有限公司 | Method and device for executing webpage task |
CN105677862A (en) * | 2016-01-08 | 2016-06-15 | 上海数道信息科技有限公司 | Method and device for grabbing webpage content |
CN105631044A (en) * | 2016-01-29 | 2016-06-01 | 四川长虹电器股份有限公司 | Convergence method of network video resources |
CN105893571A (en) * | 2016-03-31 | 2016-08-24 | 乐视控股(北京)有限公司 | Method and system for establishing content tag of video |
CN107257390A (en) * | 2017-05-27 | 2017-10-17 | 北京思特奇信息技术股份有限公司 | A kind of parsing method and system of URL addresses |
CN107257390B (en) * | 2017-05-27 | 2020-10-09 | 北京思特奇信息技术股份有限公司 | URL address resolution method and system |
CN107547912A (en) * | 2017-09-01 | 2018-01-05 | 深圳创维数字技术有限公司 | A kind of method for processing resource, system and the storage medium of full matchmaker's money |
CN107943869A (en) * | 2017-11-10 | 2018-04-20 | 深圳市华阅文化传媒有限公司 | The method and apparatus for reading third party's webpage |
CN110362393A (en) * | 2019-07-18 | 2019-10-22 | 北京明略软件系统有限公司 | The detection method and device of increment task |
CN112612774A (en) * | 2020-12-17 | 2021-04-06 | 武汉达梦数据技术有限公司 | Metadata analysis method and device based on page and database comparison |
CN112612774B (en) * | 2020-12-17 | 2022-07-26 | 武汉达梦数据技术有限公司 | Metadata analysis method and device based on page and database comparison |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104391917A (en) | Method for incrementally capturing webpage contents | |
CN102737116B (en) | A kind of web page resources store method and device | |
CN101594241B (en) | Method and system for downloading network video | |
CN105488187A (en) | Method and device for extracting multi-source heterogeneous data increment | |
CN105045838A (en) | Network crawler system based on distributed storage system | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN105631044A (en) | Convergence method of network video resources | |
CN103714116A (en) | Webpage information extracting method and webpage information extracting equipment | |
CN103823907A (en) | Method, device and engine for integrating on-line video resource addresses | |
CN102254027A (en) | Method for obtaining webpage contents in batch | |
CN102609412A (en) | RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN102141926A (en) | Application program management method, device and digital television set top box | |
CN103365967A (en) | Automatic difference detection method and device based on crawler | |
CN103761257A (en) | Webpage handling method and system based on mobile browser | |
CN103970787A (en) | Incremental updating and crawling technology | |
CN104765823A (en) | Method and device for collecting website data | |
CN104317857A (en) | House information acquisition service system | |
CN105069032A (en) | Filtering expression and rendering engine based method for automatically monitoring update of dynamic webpage | |
CN104778252A (en) | Index storage method and index storage device | |
CN103870495A (en) | Method and device for extracting information from website | |
CN103744944A (en) | Method for re-filtering in webpage or data crawling by web crawler | |
CN104156458A (en) | Information extraction method and device | |
US20150269277A1 (en) | Storing method and apparatus for data aquisition | |
CN102902791B (en) | Web page classification storage system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150304 |