CN102609416B - Webpage information storage control and method - Google Patents
Webpage information storage control and method Download PDFInfo
- Publication number
- CN102609416B CN102609416B CN201110023799.2A CN201110023799A CN102609416B CN 102609416 B CN102609416 B CN 102609416B CN 201110023799 A CN201110023799 A CN 201110023799A CN 102609416 B CN102609416 B CN 102609416B
- Authority
- CN
- China
- Prior art keywords
- html document
- control
- webpage
- data
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Abstract
A kind of webpage information store method, the method includes: obtain the html document of this webpage every a scheduled time;Resolve the html document of this webpage, extract data in the html document of this webpage;The relatively html document of the named web page of the acquisition of this parsing is the most consistent with the data of the HTML of preservation;When the data of the html document of named web page of acquisition of this parsing and the HTML of preservation are inconsistent, replace the data in the html document specified of this preservation by the data in the html document specified of this acquisition.The present invention also provides for a kind of control, can be upgraded in time the contents such as the webpage of appointed website, picture, video by the method and this control.
Description
Technical field
The present invention relates to a kind of webpage information storage control and method, particularly to one by one
Individual website is gone dynamically to obtain the up-to-date information of a named web page and the control preserved in time and method.
Background technology
At present, we by the auto-programming of a webpage, such as Baidu Aranea, come to visit sometimes
Ask the contents such as other webpages on the Internet, picture, video, set up index data base, thus
Enable a user to search the contents such as the webpage of other websites, picture, video in the web page.
But this auto-programming can not go to capture the contents such as the webpage of the website specified, picture, video,
And when having renewal in webpage in other websites, picture, video etc., this auto-programming differs
Upgrade in time content in its index data base surely.
Summary of the invention
In view of this, it is necessary to a kind of webpage information storage control and method are provided, can the most more
The contents such as the webpage of newly specified website, picture, video.
A kind of webpage information storage control, this control include an input control, one obtain control,
One resolve control, one judge control and one update control, this input control for provide one operate
Interface for users inputs the web page address specified, and this acquisition control is for carrying by this input control
The web page address specified of confession, periodically obtains the current html document of named web page,
This parsing control is for extracting the current html document of the named web page that this acquisition control obtains
Data, this judgement control is additionally operable to compare the acquisition of this parsing and this preservation named web page
In html document in data whether consistent, when this acquisition and the named web page of this preservation
In html document in data consistent time, this renewal control is for according to this parsing control institute
The data of the current html document of the named web page extracted are corresponding before updating this named web page
The data of html document.
A kind of webpage information store method, the method includes: obtains this every a scheduled time and refers to
Determine the html document of webpage;Resolve the html document of this named web page, extract this appointment
Data in the html document of webpage;The relatively HTML of the named web page of the acquisition of this parsing
Document is the most consistent with the data of the HTML of preservation;Named web page when the acquisition of this parsing
Time the data of the HTML of html document and preservation are inconsistent, with specifying of this acquisition
Data in html document replace the data in the html document specified of this preservation.
This acquisition control obtains the html document of this named web page, and this parsing control resolves this and refers to
Determine the html document of webpage, extract the data in the html document of this named web page, should
The html document judging current html document and this preservation that control compares this parsing is
No unanimously, when inconsistent, this renewal control updates the data in the html document of this preservation.
Thus the contents such as the webpage of appointed website, picture, video that can upgrade in time.
Accompanying drawing explanation
Fig. 1 is the block diagram of webpage information storage control in an embodiment of the present invention.
Fig. 2 is the flow chart of webpage information store method in an embodiment of the present invention.
Main element symbol description
Webpage information storage control | 100 |
Input control | 10 |
Obtain control | 20 |
Resolve control | 30 |
Judge control | 40 |
Update control | 50 |
Detailed description of the invention
Refer to Fig. 1, be the block diagram of a webpage information storage control 100.This webpage
Information storage control 100 is a source program code, and it is arranged at the program code of website and webpage
In, such as in the program code of the homepage of one portal website.This webpage information storage control 100
Obtain control 20, parsing control 30, including an input control 10, and judge control 40
And one update control 50.
This input control 10 is for providing an inputting interface, the webpage specified needed for inputting for user
Address, and the web page address that user inputs is saved in the URL (Uniform/ of this website
Universal Resource Locator, web page address) in.
This acquisition control 20 is by URL (the Uniform/Universal Resource in this website
Locator, web page address) the middle web page address specified arranged is at interval of a scheduled time (such as
2 days) obtain HTML (HyperText Mark-up Language, the hypertext of this named web page
Markup language or HTML) document.Specifically, this acquisition control 10 profit
Simulate webpage by the webBrowser class in .net to log in, so that with in javascript
Document.getElementsByTagName (" HTML ") [0] .outerHTML method obtains
This named web page html document.Wherein, this scheduled time also also can be by user by system default
The inputting interface provided by this input control 10 is set.
This parsing control 30 resolves this appointment of current acquisition for utilizing Document object
The html document (calling " current html document " in the following text) of webpage and this named web page it
The html document (calling " html document of preservation " in the following text) of front preservation, passes through
GetElementById obtains the data in the html document that this is current and preservation respectively
Data in html document.Wherein, any webpage all includes control, such as list, general
Logical buttons etc., the data of the html document of this named web page that this parsing control 30 resolves are i.e.
For the data in the control of this named web page.
This judgement control 40 is additionally operable to obtain the new of this named web page at this acquisition control 10
During html document, compare the data in the related control in this current html document and guarantor
The data of the related control in the html document deposited are the most consistent.
HTML when the data in the related control in the html document that this is current Yu preservation
When the data of the related control in document are inconsistent, this renewal control 50 is with this current HTML
Data in related control in document replace related control in original html document preserved
Data, and preserve this replacement data.
This judges that control 40 is additionally operable to judge that whether the named web page html document of this acquisition is
Obtain first.When this current html document is for obtaining first, this renewal control 50 will
This html document preserves.When this current html document is not for obtaining first, this solution
Analysis control 30 resolves the html document of this named web page.
Refer to Fig. 2, for the flow process of the webpage information store method in an embodiment of the present invention
Figure.
In step s 201, this acquisition control 20 is by the institute of input in input control 10
The web page address that need to specify, periodically obtains the html document of this webpage specified.
In step S202, this judges that control 40 judges that whether this current html document is
Obtain first.When this current html document is for obtaining first, perform step S206,
When this current html document is not for obtaining first, perform step S203.
In step S203, this parsing control 30 utilizes Document object to resolve should
Front html document and the html document of preservation, thus obtain this current HTML respectively
In related control in document data and preservation html document in related control in
Data.
In step S204, this judgement control 40 obtains this appointment net at this acquisition control 10
During the new html document of page, compare related control in this current html document
Data are the most consistent with the data in the related control in the html document of this preservation.When deserving
The front data of related control in html document and the phase in the html document of this preservation
Close the data in control inconsistent time, perform step S205.
In step S205, relevant with in this current html document of this renewal control 50
Data in control replace the data in the related control in the html document of this preservation, and
Preserve this replacement data.
In step S206, this renewal control 50 preserves this html document.
Those skilled in the art are it should be appreciated that above embodiment is only to use
The present invention is described, and is not used as limitation of the invention, as long as in the essence of the present invention
Within scope, the suitably change being made above example and change all fall and want in the present invention
Ask within the scope of protection.
Claims (7)
1. a webpage information storage control, it is characterised in that: this control include an input control,
One obtain control, one resolve control, one judge control and one update control, this input control use
In providing an operation interface for users to input the web page address specified, this acquisition control is used for passing through
The web page address specified that this input control provides, periodically obtains the current of named web page
Html document, this parsing control is for extracting the current of the named web page of this acquisition control acquisition
The data of html document, this judgement control is additionally operable to compare the acquisition of this parsing and this preservation
Named web page in html document in data whether consistent, when this acquisition and this preservation
Named web page in html document in data inconsistent time, this renewal control be used for basis
The data of the current html document of the named web page that this parsing control is extracted update this appointment
The data of html document corresponding before webpage.
2. webpage information storage control as claimed in claim 1, it is characterised in that: this judgement
Control is additionally operable to judge whether the html document of this webpage is to obtain first, when this webpage
When html document is for obtaining first, this renewal control directly preserves this html document, when this
When the html document of webpage is not to obtain first, this parsing control resolves in this named web page
Data in html document.
3. webpage information storage control as claimed in claim 1, it is characterised in that: this parsing
Control utilizes the related data in this named web page of Document object extraction.
4. webpage information storage control as claimed in claim 1, it is characterised in that: this control
Being a program code, this program code is positioned in the program of this webpage.
5. a webpage information store method, it is characterised in that the method includes:
The html document of this webpage is obtained every a scheduled time;
Resolve the html document of this webpage, extract data in the html document of this webpage;
The relatively HTML's of the html document of the named web page of the acquisition of this parsing and preservation
Data are the most consistent;
When the html document of named web page of acquisition of this parsing and the number of the HTML of preservation
According to time inconsistent, replace the finger of this preservation by the data in the html document specified of this acquisition
The fixed data in html document.
6. webpage information store method as claimed in claim 5, it is characterised in that the method
Also include:
Whether the html document judging this webpage specified is to obtain first;
When the html document of this webpage specified is for obtaining first, preserve the appointment of this acquisition
The html document of webpage;
When the html document of this webpage specified is not for obtaining first, resolve the sum of this acquisition
Data in the html document of the webpage specified of this preservation.
7. webpage information store method as claimed in claim 5, it is characterised in that: this extraction
In the html document of this webpage, the mode of data is for utilizing Document object.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110023799.2A CN102609416B (en) | 2011-01-21 | Webpage information storage control and method | |
TW100108520A TWI494781B (en) | 2011-01-21 | 2011-03-14 | Activex capable of saving the information of the webpage and method thereof |
US13/076,463 US20120192060A1 (en) | 2011-01-21 | 2011-03-31 | System and method for updating html documents in an html document updating device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110023799.2A CN102609416B (en) | 2011-01-21 | Webpage information storage control and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102609416A CN102609416A (en) | 2012-07-25 |
CN102609416B true CN102609416B (en) | 2016-12-14 |
Family
ID=
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8612420B2 (en) | Configuring web crawler to extract web page information | |
US8762556B2 (en) | Displaying content on a mobile device | |
CN101427229B (en) | Technique for modifying presentation of information displayed to end users of a computer system | |
TWI322950B (en) | ||
US20140317482A1 (en) | Client side page processing | |
US20150067476A1 (en) | Title and body extraction from web page | |
US20110087966A1 (en) | Internet customization system | |
CN101042694B (en) | Method for accessing father page in the time of brewing web page | |
US20120317472A1 (en) | Creation of data extraction rules to facilitate web scraping of unstructured data from web pages | |
US10853319B2 (en) | System and method for display of document comparisons on a remote device | |
US20210042466A1 (en) | Detecting compatible layouts for content-based native ads | |
CN106354484A (en) | Browser compatibility method and browser | |
US20220114269A1 (en) | Page processing method, electronic apparatus and non-transitory computer-readable storage medium | |
CN104268282A (en) | Web banner advertisement displaying method and system | |
KR101402146B1 (en) | Method for scraping web screen in mobile device and mobile device providing web screen scraping | |
CN105204806A (en) | Individual display method and device for mobile terminal webpage | |
US9817801B2 (en) | Website content and SEO modifications via a web browser for native and third party hosted websites | |
US20120310893A1 (en) | Systems and methods for manipulating and archiving web content | |
CN103246680B (en) | A kind of method in browser, web page contents polymerization being represented and device | |
KR102290380B1 (en) | Page construction method, apparatus, device and non-volatile computer storage medium | |
JP5216654B2 (en) | Importance determination device, importance determination method, and program | |
CN102609416B (en) | Webpage information storage control and method | |
US10163118B2 (en) | Method and apparatus for associating user engagement data received from a user with portions of a webpage visited by the user | |
TWI494781B (en) | Activex capable of saving the information of the webpage and method thereof | |
WO2014027237A1 (en) | Systems and methods for web localization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161214 Termination date: 20180121 |