CN102103622A - Web site rebuilding system and method - Google Patents
Web site rebuilding system and method Download PDFInfo
- Publication number
- CN102103622A CN102103622A CN2009103118771A CN200910311877A CN102103622A CN 102103622 A CN102103622 A CN 102103622A CN 2009103118771 A CN2009103118771 A CN 2009103118771A CN 200910311877 A CN200910311877 A CN 200910311877A CN 102103622 A CN102103622 A CN 102103622A
- Authority
- CN
- China
- Prior art keywords
- data
- webpage
- website
- hierarchical
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a web site rebuilding system. The system comprises a downloading module, a non-web page processing module, a filter module, a judgment module, a hierarchical data processing module, a web page text processing module and a reconstruction module, wherein the downloading module is used for downloading data of a target web site; the non-web page processing module is used for storing non-web page data when the downloaded data are the non-web page data; the filter module is used for filtering web pages when the downloaded data are the web pages; the judgment module is used for judging whether the filtered web pages contain links pointing other data of a target web site, hierarchical data and web page texts; the hierarchical data processing module is used for resolving the hierarchical relationship of the hierarchical data, storing the hierarchical data according to the hierarchical relationship, and judging whether the leaf nodes of the hierarchical data contain the links pointing other data of the target web site; the web page text processing module is used for storing the web page texts; and the reconstruction module is used for forming a local web site by using the stored non-web page data, hierarchical data and web page texts according to an original structure. The invention also provides a web site rebuilding method. The system and the method can rebuild a remote web site to a local site.
Description
Technical field
The present invention relates to a kind of website data download system and method, particularly about a kind of website reconstructing system and method.
Background technology
Along with the internet popularize day by day and fast-developing, people are more and more higher to the degree of dependence of network.For some and professional closely-related website, the user wishes to set up corresponding disk-based web site, site information is preserved, so that be used when needed.
Summary of the invention
In view of above content, be necessary to provide a kind of website reconstructing system, long-range website can be rebuild to local.
In addition, also be necessary to provide a kind of website method for reconstructing, long-range website can be rebuild to local.
A kind of website reconstructing system runs in the home server, and this home server and remote server communicate to connect, and this remote server establishes the targeted website, and this system comprises: download module is used for from the data of remote server download targeted website; Non-webpage processing module is used for when data downloaded is non-web data, stores this non-web data; Filtering module is used for when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage; Judge module, be used to judge whether the webpage after the filtration comprises link, hierarchical data and the Web page text of other data of definite object website, for the link of other data of the definite object website in the webpage after filtering, described download module is downloaded the data of this link sensing from remote server; The hierarchical data processing module, be used for hierarchical data for the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, described download module is downloaded the data that this link is pointed to from remote server; The Web page text processing module is used for the Web page text of the webpage behind the stored filter; And reconstructed module, be used for non-web data, hierarchical data and the Web page text of storage are formed disk-based web site according to original structure.
A kind of website method for reconstructing, this method comprises: download step: the data of downloading the targeted website; Non-webpage treatment step: when data downloaded is non-web data, store this non-web data; Filtration step: when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage; Determining step: judge whether the webpage after filtering comprises link, hierarchical data and the Web page text of other data of definite object website; Link treatment step: for the link of other data of the definite object website in the webpage after filtering, return download step, download the data of this link sensing from remote server; Hierarchical data treatment step: for the hierarchical data in the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, return download step, download the data that this link is pointed to from remote server; Web page text treatment step:, store this Web page text for the Web page text in the webpage after filtering; And reconstruction step: non-web data, hierarchical data and the Web page text that will store forms disk-based web site according to original structure.
The present invention downloads and is saved to this locality with the data of targeted website according to its original structure, and forms disk-based web site according to this structure.
Description of drawings
Fig. 1 is the applied environment synoptic diagram of website of the present invention reconstructing system preferred embodiment.
Fig. 2 is the functional block diagram of website reconstructing system among Fig. 1.
Fig. 3 is the process flow diagram of website of the present invention method for reconstructing preferred embodiment.
The main element symbol description
The |
10 |
Home server | 11 |
|
12 |
|
13 |
The targeted |
14 |
|
15 |
|
16 |
|
17 |
Download module | 200 |
Non-webpage processing module | 210 |
Filtering module | 220 |
Judge module | 230 |
The hierarchical data processing module | 240 |
The Web page text processing module | 250 |
Reconstructed module | 260 |
Embodiment
Consulting shown in Figure 1ly, is the applied environment synoptic diagram of website of the present invention reconstructing system preferred embodiment.This website reconstructing system 10 runs in the home server 11.This home server 11 communicates to connect by external network 13 (for example internet) and remote server 12.This remote server 12 establishes targeted website 14, packet purse rope page or leaf of this targeted website 14 (for example webpage of html format) and non-web data (for example file of exe, doc or zip form).Described home server 11 also communicates to connect by internal network 16 and a plurality of clients 15.The database 17 that with home server 11 be connected downloaded and stored to described website reconstructing system 10 according to its original structure from remote server 12 with the data of targeted website 14, and data downloaded is formed disk-based web site and presents to the local user by client 15 according to its original structure.Described client 15 can be any suitable data processing equipment, for example personal computer, mobile phone and personal digital assistant.
Consulting shown in Figure 2ly, is the functional block diagram of reconstructing system 10 preferred embodiments in website among Fig. 1.Described website reconstructing system 10 comprises download module 200, non-webpage processing module 210, filtering module 220, judge module 230, hierarchical data processing module 240, Web page text processing module 250 and reconstructed module 260.
Described download module 200 is used for downloading the data of targeted website 14 from remote server 12.In the present embodiment, download module 200 is downloaded the data of targeted website 14 according to predefined frequency.For example, download module 200 is downloaded the data of targeted website 14 with weekly frequency from remote server 12.In the present embodiment, download module 200 is at first downloaded the homepage of targeted website 14.During data download, download module 200 sends the data download request according to procotol to remote server 12, and remote server 12 these data download requests of response are also returned corresponding data.
Described non-webpage processing module 210 is used for when data downloaded is non-web data, stores the non-web data of downloading into database 17.In the present embodiment, whether non-webpage processing module 210 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store the non-web data of this download.Otherwise,, then only when the non-web data of this download changes, just this non-web data is stored if database 17 has corresponding data.
Described filtering module 220 is used for when data downloaded is webpage, this webpage filtered, and to remove the invalid data in this webpage, for example top margin, page footing, advertisement and the link of pointing to other websites.In the present embodiment, filtering module 220 utilizes the invalid data in the keyword search webpage.For example, for the advertisement in the webpage, utilize key word " broadcast " to search.In the present embodiment, filtering module 220 also carries out correction process to webpage, for example revises incomplete link in the webpage.
Described judge module 230 is used to judge whether the webpage after the filtration comprises link, hierarchical data or the Web page text of other data of definite object website 14.Described hierarchical data is the data with hierarchical relationship, for example has the data of tree structure.In the present embodiment, judge module 230 judges at first whether the webpage after filtering is unusual webpage, if unusual webpage does not then carry out other processing to the webpage after this filtration.In the present embodiment, judge module 230 utilizes regular expression to judge whether the webpage after filtering comprises link, hierarchical data or the Web page text of other data of definite object website 14.For example, judge module 230 utilize regular expression<a href=" ${url} "〉webpage after judge filtering whether comprise the link of other data of definite object website 14.And for example, judge module 230 utilize regular expression (?<li〉<li<a[^] * [^<〉] *</a [^<〉n] *) (?<n〉n s*) (?=<li〉</ul 〉) judge whether the webpage after filtering comprises hierarchical data.
Described hierarchical data processing module 240 is used for the hierarchical data that the webpage after this filtration comprises is handled.At first, the hierarchical relationship of hierarchical data processing module 240 analytic sheaf secondary data, the leafy node of acquisition hierarchical data.In the present embodiment, hierarchical data processing module 240 is utilized the hierarchical relationship of regular expression analytic sheaf secondary data.Secondly, hierarchical data processing module 240 stores hierarchical data into database 17 according to its hierarchical relationship.In the present embodiment, with the similar ground of non-web data, whether hierarchical data processing module 240 at first judgment data storehouse 17 stores corresponding data.If database 17 does not have corresponding data, then directly store this hierarchical data.Otherwise,, then only when the hierarchical relationship of this hierarchical data changes, just this hierarchical data is stored if database 17 has corresponding data.At last, hierarchical data processing module 240 judges whether the leafy node of hierarchical data comprises the link of other data of definite object website 14.
Described Web page text processing module 250 is used for storing the Web page text that the webpage after filtering comprises into database 17.In the present embodiment, with non-web data and the similar ground of hierarchical data, whether Web page text processing module 250 at first judgment data storehouse 17 stores corresponding data.If database 17 does not have corresponding data, then directly store this Web page text.Otherwise,, then only when this Web page text changes, just this Web page text is stored if database 17 has corresponding data.
Described reconstructed module 260 is used for non-web data, hierarchical data and the Web page text of database 17 storages are formed disk-based web site according to its original structure, presents to the local user by client 15.
Consulting shown in Figure 3ly, is the process flow diagram of website of the present invention method for reconstructing preferred embodiment.
Step S301, download module 200 is downloaded the data of targeted website 14 from remote server 12.In the present embodiment, download module 200 is downloaded the data of targeted website 14 according to predefined frequency.For example, download module 200 is downloaded the data of targeted website 14 with weekly frequency from remote server 12.As previously mentioned, packet purse rope page or leaf of described targeted website 14 (for example webpage of html format) and non-web data (for example file of exe, doc or zip form).In the present embodiment, download module 200 is at first downloaded the homepage of targeted website 14.During data download, download module 200 sends the data download request according to procotol to remote server 12, and remote server 12 these data download requests of response are also returned corresponding data.
Step S302, judge module 230 judge whether data downloaded is webpage.
If data downloaded is non-web data, step S303 then, non-webpage processing module 210 should store database 17 into by non-web data.In the present embodiment, whether non-webpage processing module 210 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store this non-web data.Otherwise,, then only when this non-web data changes, just this non-web data is stored if database 17 has corresponding data.
If data downloaded is webpage, step S304 then, 220 pairs of these webpages of filtering module filter, to remove the invalid data in this webpage, for example top margin, page footing, advertisement and the link of pointing to other websites.In the present embodiment, filtering module 220 utilizes the invalid data in the keyword search webpage.For example, for the advertisement in the webpage, utilize key word " broadcast " to search.In the present embodiment, filtering module 220 also carries out correction process to webpage, for example revises incomplete link in the webpage.
Step S305, judge module 230 judges whether the webpage after filtering is unusual webpage.If the webpage after filtering is unusual webpage, then execution in step S313.Described unusual webpage comprises blank or wrong webpage.
If the webpage after filtering is not unusual webpage, step S306 then, judge module 230 judges whether the webpage after this filtration comprises the link of other data of definite object website 14.In the present embodiment, judge module 230 utilizes regular expression to judge whether the webpage after filtering comprises the link of other data of definite object website 14.For example, judge module 230 utilizes regular expression<a href=" "〉that the link of pointing to other data of targeted website 14 in the webpage after filtering is judged.If the webpage after filtering comprises the link of other data of definite object website 14, then return step S301, download module 200 is downloaded the data that this link is pointed to.
If the webpage after this filtration does not comprise the link of other data of definite object website 14, step S307 then, judge module 230 judges whether the webpage after this filtration comprises hierarchical data.Described hierarchical data is the data with hierarchical relationship, for example has the data of tree structure.In the present embodiment, judge module 230 utilizes regular expression to judge whether the webpage after filtering comprises hierarchical data.For example, judge module 230 utilize regular expression (?<li〉<li<a[^] * [^<〉] *</a [^<〉n] *) (?<n〉n s*) (?=<li〉</ul 〉) hierarchical data in the webpage after filtering is judged.
If the webpage after this filtrations comprises hierarchical data, step S308 then, the hierarchical relationship of hierarchical data processing module 240 analytic sheaf secondary data, the leafy node of acquisition hierarchical data.In the present embodiment, hierarchical data processing module 240 is utilized the hierarchical relationship of regular expression analytic sheaf secondary data.
Step S309, hierarchical data processing module 240 stores hierarchical data into database 17 according to its hierarchical relationship.In the present embodiment, whether hierarchical data processing module 240 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store this hierarchical data.Otherwise,, then only when the hierarchical relationship of this hierarchical data changes, just this hierarchical data is stored if database 17 has corresponding data.
Step S310, hierarchical data processing module 240 judges whether the leafy node of hierarchical data comprises the link of other data of definite object website 14.If the leafy node of hierarchical data comprises the link of other data of definite object website 14, then return step S301, download module 200 is downloaded the data that this link is pointed to.If the leafy node of hierarchical data does not comprise the link of other data of definite object website 14, then execution in step S313.
Step S311, judge module 230 judge whether the webpage after this filtration comprises Web page text.If the webpage after this filtration does not comprise Web page text, then execution in step S313.
If the webpage after this filtration comprises Web page text, step S312 then, Web page text processing module 250 stores this Web page text into database 17.In the present embodiment, whether Web page text processing module 250 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store this Web page text.Otherwise,, then only when this Web page text changes, just this Web page text is stored if database 17 has corresponding data.
Step S313, reconstructed module 260 forms disk-based web site with non-web data, hierarchical data and the Web page text of database 17 storages according to its original structure, presents to the local user by client 15.
Claims (10)
1. a website reconstructing system runs in the home server, and this home server and remote server communicate to connect, and this remote server establishes the targeted website, it is characterized in that, this system comprises:
Download module is used for from the data of remote server download targeted website;
Non-webpage processing module is used for when data downloaded is non-web data, stores this non-web data;
Filtering module is used for when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage;
Judge module, be used to judge whether the webpage after the filtration comprises link, hierarchical data and the Web page text of other data of definite object website, for the link of other data of the definite object website in the webpage after filtering, described download module is downloaded the data of this link sensing from remote server;
The hierarchical data processing module, be used for hierarchical data for the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, described download module is downloaded the data that this link is pointed to from remote server;
The Web page text processing module is used for the Web page text of the webpage behind the stored filter; And
Reconstructed module is used for non-web data, hierarchical data and the Web page text of storage are formed disk-based web site according to original structure.
2. website as claimed in claim 1 reconstructing system is characterized in that described filtering module utilizes key word that webpage is filtered.
3. website as claimed in claim 1 reconstructing system is characterized in that described filtering module also carries out correction process to webpage.
4. website as claimed in claim 1 reconstructing system is characterized in that, described judge module judges at first whether the webpage after the filtration is unusual webpage, if unusual webpage does not then carry out other processing to the webpage after this filtration.
5. website as claimed in claim 1 reconstructing system is characterized in that, described download module is downloaded the data of targeted website according to predefined frequency.
6. a website method for reconstructing is characterized in that, this method comprises:
Download step: the data of downloading the targeted website;
Non-webpage treatment step: when data downloaded is non-web data, store this non-web data;
Filtration step: when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage;
Determining step: judge whether the webpage after filtering comprises link, hierarchical data and the Web page text of other data of definite object website;
Link treatment step: for the link of other data of the definite object website in the webpage after filtering, return download step, download the data of this link sensing from remote server;
Hierarchical data treatment step: for the hierarchical data in the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, return download step, download the data that this link is pointed to from remote server;
Web page text treatment step:, store this Web page text for the Web page text in the webpage after filtering; And
Reconstruction step: non-web data, hierarchical data and the Web page text that will store forms disk-based web site according to original structure.
7. website as claimed in claim 6 method for reconstructing is characterized in that, described filtration step is to utilize key word that webpage is filtered.
8. website as claimed in claim 6 method for reconstructing is characterized in that, described filtration step also comprises webpage is carried out correction process.
9. website as claimed in claim 6 method for reconstructing is characterized in that, comprises also before the described determining step whether the webpage of judging after filtering is unusual webpage, if unusual webpage does not then carry out other processing to the webpage after this filtration.
10. website as claimed in claim 6 method for reconstructing is characterized in that, described download step is to download the data of targeted website according to predefined frequency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009103118771A CN102103622A (en) | 2009-12-21 | 2009-12-21 | Web site rebuilding system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009103118771A CN102103622A (en) | 2009-12-21 | 2009-12-21 | Web site rebuilding system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102103622A true CN102103622A (en) | 2011-06-22 |
Family
ID=44156398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009103118771A Pending CN102103622A (en) | 2009-12-21 | 2009-12-21 | Web site rebuilding system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102103622A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106802893A (en) * | 2015-11-26 | 2017-06-06 | 财团法人资讯工业策进会 | Website method for simplifying and the website simplification device using it |
CN110209971A (en) * | 2019-05-15 | 2019-09-06 | 朱容宇 | A kind of method and system of website recombination reduction |
-
2009
- 2009-12-21 CN CN2009103118771A patent/CN102103622A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106802893A (en) * | 2015-11-26 | 2017-06-06 | 财团法人资讯工业策进会 | Website method for simplifying and the website simplification device using it |
CN110209971A (en) * | 2019-05-15 | 2019-09-06 | 朱容宇 | A kind of method and system of website recombination reduction |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9940391B2 (en) | System, method and computer readable medium for web crawling | |
CN101127038B (en) | System and method for downloading website static web page | |
CN102254027B (en) | Method for obtaining webpage contents in batch | |
CN106484828B (en) | Distributed internet data rapid acquisition system and acquisition method | |
US8131753B2 (en) | Apparatus and method for accessing and indexing dynamic web pages | |
CN107145556B (en) | Universal distributed acquisition system | |
CN102567407B (en) | Method and system for collecting forum reply increment | |
CN102333122A (en) | Downloaded resource provision method, device and system | |
CN103279567A (en) | Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language) | |
CN103324669A (en) | Method and client for processing web page bookmark | |
CN102521251A (en) | Method for directly realizing personalized search, device for realizing method, and search server | |
CN102663062A (en) | Method and device for processing invalid links in search result | |
CN101630330A (en) | Method for webpage classification | |
CN102750352A (en) | Method and device for classified collection of historical access records in browser | |
CN103297469A (en) | Method and device of collecting website data | |
CN104182412A (en) | Webpage crawling method and webpage crawling system | |
CN103617213A (en) | Method and system for identifying newspage attributive characters | |
CN103123640A (en) | Method and device for searching novel | |
CN102004805B (en) | Webpage denoising system and method based on maximum similarity matching | |
CN103605742B (en) | Recognize the method and device of Internet resources entity catalogue page | |
CN102103622A (en) | Web site rebuilding system and method | |
CN103955517A (en) | Method and system for converting data in documental database to relational database | |
CN105183843A (en) | List page recognition system and method | |
US8706705B1 (en) | System and method for associating data relating to features of a data entity | |
CN104835052A (en) | Method and system for improving network advertisement delivery precision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20110622 |