CN102609416B - Webpage information storage control and method - Google Patents

Webpage information storage control and method Download PDF

Info

Publication number
CN102609416B
CN102609416B CN201110023799.2A CN201110023799A CN102609416B CN 102609416 B CN102609416 B CN 102609416B CN 201110023799 A CN201110023799 A CN 201110023799A CN 102609416 B CN102609416 B CN 102609416B
Authority
CN
China
Prior art keywords
html document
control
webpage
data
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110023799.2A
Other languages
Chinese (zh)
Other versions
CN102609416A (en
Inventor
翁世芳
陆欣
刘耀华
吴云艳
林希
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yuzhan Precision Technology Co ltd
Hon Hai Precision Industry Co Ltd
Original Assignee
Shenzhen Yuzhan Precision Technology Co ltd
Hon Hai Precision Industry Co Ltd
Filing date
Publication date
Application filed by Shenzhen Yuzhan Precision Technology Co ltd, Hon Hai Precision Industry Co Ltd filed Critical Shenzhen Yuzhan Precision Technology Co ltd
Priority to CN201110023799.2A priority Critical patent/CN102609416B/en
Priority to TW100108520A priority patent/TWI494781B/en
Priority to US13/076,463 priority patent/US20120192060A1/en
Publication of CN102609416A publication Critical patent/CN102609416A/en
Application granted granted Critical
Publication of CN102609416B publication Critical patent/CN102609416B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of webpage information store method, the method includes: obtain the html document of this webpage every a scheduled time;Resolve the html document of this webpage, extract data in the html document of this webpage;The relatively html document of the named web page of the acquisition of this parsing is the most consistent with the data of the HTML of preservation;When the data of the html document of named web page of acquisition of this parsing and the HTML of preservation are inconsistent, replace the data in the html document specified of this preservation by the data in the html document specified of this acquisition.The present invention also provides for a kind of control, can be upgraded in time the contents such as the webpage of appointed website, picture, video by the method and this control.

Description

Webpage information storage control and method
Technical field
The present invention relates to a kind of webpage information storage control and method, particularly to one by one Individual website is gone dynamically to obtain the up-to-date information of a named web page and the control preserved in time and method.
Background technology
At present, we by the auto-programming of a webpage, such as Baidu Aranea, come to visit sometimes Ask the contents such as other webpages on the Internet, picture, video, set up index data base, thus Enable a user to search the contents such as the webpage of other websites, picture, video in the web page. But this auto-programming can not go to capture the contents such as the webpage of the website specified, picture, video, And when having renewal in webpage in other websites, picture, video etc., this auto-programming differs Upgrade in time content in its index data base surely.
Summary of the invention
In view of this, it is necessary to a kind of webpage information storage control and method are provided, can the most more The contents such as the webpage of newly specified website, picture, video.
A kind of webpage information storage control, this control include an input control, one obtain control, One resolve control, one judge control and one update control, this input control for provide one operate Interface for users inputs the web page address specified, and this acquisition control is for carrying by this input control The web page address specified of confession, periodically obtains the current html document of named web page, This parsing control is for extracting the current html document of the named web page that this acquisition control obtains Data, this judgement control is additionally operable to compare the acquisition of this parsing and this preservation named web page In html document in data whether consistent, when this acquisition and the named web page of this preservation In html document in data consistent time, this renewal control is for according to this parsing control institute The data of the current html document of the named web page extracted are corresponding before updating this named web page The data of html document.
A kind of webpage information store method, the method includes: obtains this every a scheduled time and refers to Determine the html document of webpage;Resolve the html document of this named web page, extract this appointment Data in the html document of webpage;The relatively HTML of the named web page of the acquisition of this parsing Document is the most consistent with the data of the HTML of preservation;Named web page when the acquisition of this parsing Time the data of the HTML of html document and preservation are inconsistent, with specifying of this acquisition Data in html document replace the data in the html document specified of this preservation.
This acquisition control obtains the html document of this named web page, and this parsing control resolves this and refers to Determine the html document of webpage, extract the data in the html document of this named web page, should The html document judging current html document and this preservation that control compares this parsing is No unanimously, when inconsistent, this renewal control updates the data in the html document of this preservation. Thus the contents such as the webpage of appointed website, picture, video that can upgrade in time.
Accompanying drawing explanation
Fig. 1 is the block diagram of webpage information storage control in an embodiment of the present invention.
Fig. 2 is the flow chart of webpage information store method in an embodiment of the present invention.
Main element symbol description
Webpage information storage control 100
Input control 10
Obtain control 20
Resolve control 30
Judge control 40
Update control 50
Detailed description of the invention
Refer to Fig. 1, be the block diagram of a webpage information storage control 100.This webpage Information storage control 100 is a source program code, and it is arranged at the program code of website and webpage In, such as in the program code of the homepage of one portal website.This webpage information storage control 100 Obtain control 20, parsing control 30, including an input control 10, and judge control 40 And one update control 50.
This input control 10 is for providing an inputting interface, the webpage specified needed for inputting for user Address, and the web page address that user inputs is saved in the URL (Uniform/ of this website Universal Resource Locator, web page address) in.
This acquisition control 20 is by URL (the Uniform/Universal Resource in this website Locator, web page address) the middle web page address specified arranged is at interval of a scheduled time (such as 2 days) obtain HTML (HyperText Mark-up Language, the hypertext of this named web page Markup language or HTML) document.Specifically, this acquisition control 10 profit Simulate webpage by the webBrowser class in .net to log in, so that with in javascript Document.getElementsByTagName (" HTML ") [0] .outerHTML method obtains This named web page html document.Wherein, this scheduled time also also can be by user by system default The inputting interface provided by this input control 10 is set.
This parsing control 30 resolves this appointment of current acquisition for utilizing Document object The html document (calling " current html document " in the following text) of webpage and this named web page it The html document (calling " html document of preservation " in the following text) of front preservation, passes through GetElementById obtains the data in the html document that this is current and preservation respectively Data in html document.Wherein, any webpage all includes control, such as list, general Logical buttons etc., the data of the html document of this named web page that this parsing control 30 resolves are i.e. For the data in the control of this named web page.
This judgement control 40 is additionally operable to obtain the new of this named web page at this acquisition control 10 During html document, compare the data in the related control in this current html document and guarantor The data of the related control in the html document deposited are the most consistent.
HTML when the data in the related control in the html document that this is current Yu preservation When the data of the related control in document are inconsistent, this renewal control 50 is with this current HTML Data in related control in document replace related control in original html document preserved Data, and preserve this replacement data.
This judges that control 40 is additionally operable to judge that whether the named web page html document of this acquisition is Obtain first.When this current html document is for obtaining first, this renewal control 50 will This html document preserves.When this current html document is not for obtaining first, this solution Analysis control 30 resolves the html document of this named web page.
Refer to Fig. 2, for the flow process of the webpage information store method in an embodiment of the present invention Figure.
In step s 201, this acquisition control 20 is by the institute of input in input control 10 The web page address that need to specify, periodically obtains the html document of this webpage specified.
In step S202, this judges that control 40 judges that whether this current html document is Obtain first.When this current html document is for obtaining first, perform step S206, When this current html document is not for obtaining first, perform step S203.
In step S203, this parsing control 30 utilizes Document object to resolve should Front html document and the html document of preservation, thus obtain this current HTML respectively In related control in document data and preservation html document in related control in Data.
In step S204, this judgement control 40 obtains this appointment net at this acquisition control 10 During the new html document of page, compare related control in this current html document Data are the most consistent with the data in the related control in the html document of this preservation.When deserving The front data of related control in html document and the phase in the html document of this preservation Close the data in control inconsistent time, perform step S205.
In step S205, relevant with in this current html document of this renewal control 50 Data in control replace the data in the related control in the html document of this preservation, and Preserve this replacement data.
In step S206, this renewal control 50 preserves this html document.
Those skilled in the art are it should be appreciated that above embodiment is only to use The present invention is described, and is not used as limitation of the invention, as long as in the essence of the present invention Within scope, the suitably change being made above example and change all fall and want in the present invention Ask within the scope of protection.

Claims (7)

1. a webpage information storage control, it is characterised in that: this control include an input control, One obtain control, one resolve control, one judge control and one update control, this input control use In providing an operation interface for users to input the web page address specified, this acquisition control is used for passing through The web page address specified that this input control provides, periodically obtains the current of named web page Html document, this parsing control is for extracting the current of the named web page of this acquisition control acquisition The data of html document, this judgement control is additionally operable to compare the acquisition of this parsing and this preservation Named web page in html document in data whether consistent, when this acquisition and this preservation Named web page in html document in data inconsistent time, this renewal control be used for basis The data of the current html document of the named web page that this parsing control is extracted update this appointment The data of html document corresponding before webpage.
2. webpage information storage control as claimed in claim 1, it is characterised in that: this judgement Control is additionally operable to judge whether the html document of this webpage is to obtain first, when this webpage When html document is for obtaining first, this renewal control directly preserves this html document, when this When the html document of webpage is not to obtain first, this parsing control resolves in this named web page Data in html document.
3. webpage information storage control as claimed in claim 1, it is characterised in that: this parsing Control utilizes the related data in this named web page of Document object extraction.
4. webpage information storage control as claimed in claim 1, it is characterised in that: this control Being a program code, this program code is positioned in the program of this webpage.
5. a webpage information store method, it is characterised in that the method includes:
The html document of this webpage is obtained every a scheduled time;
Resolve the html document of this webpage, extract data in the html document of this webpage;
The relatively HTML's of the html document of the named web page of the acquisition of this parsing and preservation Data are the most consistent;
When the html document of named web page of acquisition of this parsing and the number of the HTML of preservation According to time inconsistent, replace the finger of this preservation by the data in the html document specified of this acquisition The fixed data in html document.
6. webpage information store method as claimed in claim 5, it is characterised in that the method Also include:
Whether the html document judging this webpage specified is to obtain first;
When the html document of this webpage specified is for obtaining first, preserve the appointment of this acquisition The html document of webpage;
When the html document of this webpage specified is not for obtaining first, resolve the sum of this acquisition Data in the html document of the webpage specified of this preservation.
7. webpage information store method as claimed in claim 5, it is characterised in that: this extraction In the html document of this webpage, the mode of data is for utilizing Document object.
CN201110023799.2A 2011-01-21 2011-01-21 Webpage information storage control and method Expired - Fee Related CN102609416B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201110023799.2A CN102609416B (en) 2011-01-21 Webpage information storage control and method
TW100108520A TWI494781B (en) 2011-01-21 2011-03-14 Activex capable of saving the information of the webpage and method thereof
US13/076,463 US20120192060A1 (en) 2011-01-21 2011-03-31 System and method for updating html documents in an html document updating device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110023799.2A CN102609416B (en) 2011-01-21 Webpage information storage control and method

Publications (2)

Publication Number Publication Date
CN102609416A CN102609416A (en) 2012-07-25
CN102609416B true CN102609416B (en) 2016-12-14

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system

Similar Documents

Publication Publication Date Title
US8612420B2 (en) Configuring web crawler to extract web page information
US8762556B2 (en) Displaying content on a mobile device
CN101427229B (en) Technique for modifying presentation of information displayed to end users of a computer system
TWI322950B (en)
US20140317482A1 (en) Client side page processing
US20150067476A1 (en) Title and body extraction from web page
US20110087966A1 (en) Internet customization system
CN101042694B (en) Method for accessing father page in the time of brewing web page
US20120317472A1 (en) Creation of data extraction rules to facilitate web scraping of unstructured data from web pages
US10853319B2 (en) System and method for display of document comparisons on a remote device
US20210042466A1 (en) Detecting compatible layouts for content-based native ads
CN106354484A (en) Browser compatibility method and browser
US20220114269A1 (en) Page processing method, electronic apparatus and non-transitory computer-readable storage medium
CN104268282A (en) Web banner advertisement displaying method and system
KR101402146B1 (en) Method for scraping web screen in mobile device and mobile device providing web screen scraping
CN105204806A (en) Individual display method and device for mobile terminal webpage
US9817801B2 (en) Website content and SEO modifications via a web browser for native and third party hosted websites
US20120310893A1 (en) Systems and methods for manipulating and archiving web content
CN103246680B (en) A kind of method in browser, web page contents polymerization being represented and device
KR102290380B1 (en) Page construction method, apparatus, device and non-volatile computer storage medium
JP5216654B2 (en) Importance determination device, importance determination method, and program
CN102609416B (en) Webpage information storage control and method
US10163118B2 (en) Method and apparatus for associating user engagement data received from a user with portions of a webpage visited by the user
TWI494781B (en) Activex capable of saving the information of the webpage and method thereof
WO2014027237A1 (en) Systems and methods for web localization

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161214

Termination date: 20180121