CN102103622A - Web site rebuilding system and method - Google Patents

Web site rebuilding system and method Download PDF

Info

Publication number
CN102103622A
CN102103622A CN2009103118771A CN200910311877A CN102103622A CN 102103622 A CN102103622 A CN 102103622A CN 2009103118771 A CN2009103118771 A CN 2009103118771A CN 200910311877 A CN200910311877 A CN 200910311877A CN 102103622 A CN102103622 A CN 102103622A
Authority
CN
China
Prior art keywords
data
webpage
website
hierarchical
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009103118771A
Other languages
Chinese (zh)
Inventor
李忠一
黄新宇
吴吕红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hongfujin Precision Industry Shenzhen Co Ltd
Hon Hai Precision Industry Co Ltd
Original Assignee
Hongfujin Precision Industry Shenzhen Co Ltd
Hon Hai Precision Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hongfujin Precision Industry Shenzhen Co Ltd, Hon Hai Precision Industry Co Ltd filed Critical Hongfujin Precision Industry Shenzhen Co Ltd
Priority to CN2009103118771A priority Critical patent/CN102103622A/en
Publication of CN102103622A publication Critical patent/CN102103622A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a web site rebuilding system. The system comprises a downloading module, a non-web page processing module, a filter module, a judgment module, a hierarchical data processing module, a web page text processing module and a reconstruction module, wherein the downloading module is used for downloading data of a target web site; the non-web page processing module is used for storing non-web page data when the downloaded data are the non-web page data; the filter module is used for filtering web pages when the downloaded data are the web pages; the judgment module is used for judging whether the filtered web pages contain links pointing other data of a target web site, hierarchical data and web page texts; the hierarchical data processing module is used for resolving the hierarchical relationship of the hierarchical data, storing the hierarchical data according to the hierarchical relationship, and judging whether the leaf nodes of the hierarchical data contain the links pointing other data of the target web site; the web page text processing module is used for storing the web page texts; and the reconstruction module is used for forming a local web site by using the stored non-web page data, hierarchical data and web page texts according to an original structure. The invention also provides a web site rebuilding method. The system and the method can rebuild a remote web site to a local site.

Description

Website reconstructing system and method
Technical field
The present invention relates to a kind of website data download system and method, particularly about a kind of website reconstructing system and method.
Background technology
Along with the internet popularize day by day and fast-developing, people are more and more higher to the degree of dependence of network.For some and professional closely-related website, the user wishes to set up corresponding disk-based web site, site information is preserved, so that be used when needed.
Summary of the invention
In view of above content, be necessary to provide a kind of website reconstructing system, long-range website can be rebuild to local.
In addition, also be necessary to provide a kind of website method for reconstructing, long-range website can be rebuild to local.
A kind of website reconstructing system runs in the home server, and this home server and remote server communicate to connect, and this remote server establishes the targeted website, and this system comprises: download module is used for from the data of remote server download targeted website; Non-webpage processing module is used for when data downloaded is non-web data, stores this non-web data; Filtering module is used for when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage; Judge module, be used to judge whether the webpage after the filtration comprises link, hierarchical data and the Web page text of other data of definite object website, for the link of other data of the definite object website in the webpage after filtering, described download module is downloaded the data of this link sensing from remote server; The hierarchical data processing module, be used for hierarchical data for the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, described download module is downloaded the data that this link is pointed to from remote server; The Web page text processing module is used for the Web page text of the webpage behind the stored filter; And reconstructed module, be used for non-web data, hierarchical data and the Web page text of storage are formed disk-based web site according to original structure.
A kind of website method for reconstructing, this method comprises: download step: the data of downloading the targeted website; Non-webpage treatment step: when data downloaded is non-web data, store this non-web data; Filtration step: when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage; Determining step: judge whether the webpage after filtering comprises link, hierarchical data and the Web page text of other data of definite object website; Link treatment step: for the link of other data of the definite object website in the webpage after filtering, return download step, download the data of this link sensing from remote server; Hierarchical data treatment step: for the hierarchical data in the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, return download step, download the data that this link is pointed to from remote server; Web page text treatment step:, store this Web page text for the Web page text in the webpage after filtering; And reconstruction step: non-web data, hierarchical data and the Web page text that will store forms disk-based web site according to original structure.
The present invention downloads and is saved to this locality with the data of targeted website according to its original structure, and forms disk-based web site according to this structure.
Description of drawings
Fig. 1 is the applied environment synoptic diagram of website of the present invention reconstructing system preferred embodiment.
Fig. 2 is the functional block diagram of website reconstructing system among Fig. 1.
Fig. 3 is the process flow diagram of website of the present invention method for reconstructing preferred embodiment.
The main element symbol description
The website reconstructing system 10
Home server 11
Remote server 12
External network 13
The targeted website 14
Client 15
Internal network 16
Database 17
Download module 200
Non-webpage processing module 210
Filtering module 220
Judge module 230
The hierarchical data processing module 240
The Web page text processing module 250
Reconstructed module 260
Embodiment
Consulting shown in Figure 1ly, is the applied environment synoptic diagram of website of the present invention reconstructing system preferred embodiment.This website reconstructing system 10 runs in the home server 11.This home server 11 communicates to connect by external network 13 (for example internet) and remote server 12.This remote server 12 establishes targeted website 14, packet purse rope page or leaf of this targeted website 14 (for example webpage of html format) and non-web data (for example file of exe, doc or zip form).Described home server 11 also communicates to connect by internal network 16 and a plurality of clients 15.The database 17 that with home server 11 be connected downloaded and stored to described website reconstructing system 10 according to its original structure from remote server 12 with the data of targeted website 14, and data downloaded is formed disk-based web site and presents to the local user by client 15 according to its original structure.Described client 15 can be any suitable data processing equipment, for example personal computer, mobile phone and personal digital assistant.
Consulting shown in Figure 2ly, is the functional block diagram of reconstructing system 10 preferred embodiments in website among Fig. 1.Described website reconstructing system 10 comprises download module 200, non-webpage processing module 210, filtering module 220, judge module 230, hierarchical data processing module 240, Web page text processing module 250 and reconstructed module 260.
Described download module 200 is used for downloading the data of targeted website 14 from remote server 12.In the present embodiment, download module 200 is downloaded the data of targeted website 14 according to predefined frequency.For example, download module 200 is downloaded the data of targeted website 14 with weekly frequency from remote server 12.In the present embodiment, download module 200 is at first downloaded the homepage of targeted website 14.During data download, download module 200 sends the data download request according to procotol to remote server 12, and remote server 12 these data download requests of response are also returned corresponding data.
Described non-webpage processing module 210 is used for when data downloaded is non-web data, stores the non-web data of downloading into database 17.In the present embodiment, whether non-webpage processing module 210 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store the non-web data of this download.Otherwise,, then only when the non-web data of this download changes, just this non-web data is stored if database 17 has corresponding data.
Described filtering module 220 is used for when data downloaded is webpage, this webpage filtered, and to remove the invalid data in this webpage, for example top margin, page footing, advertisement and the link of pointing to other websites.In the present embodiment, filtering module 220 utilizes the invalid data in the keyword search webpage.For example, for the advertisement in the webpage, utilize key word " broadcast " to search.In the present embodiment, filtering module 220 also carries out correction process to webpage, for example revises incomplete link in the webpage.
Described judge module 230 is used to judge whether the webpage after the filtration comprises link, hierarchical data or the Web page text of other data of definite object website 14.Described hierarchical data is the data with hierarchical relationship, for example has the data of tree structure.In the present embodiment, judge module 230 judges at first whether the webpage after filtering is unusual webpage, if unusual webpage does not then carry out other processing to the webpage after this filtration.In the present embodiment, judge module 230 utilizes regular expression to judge whether the webpage after filtering comprises link, hierarchical data or the Web page text of other data of definite object website 14.For example, judge module 230 utilize regular expression<a href=" ${url} "〉webpage after judge filtering whether comprise the link of other data of definite object website 14.And for example, judge module 230 utilize regular expression (?<li〉<li<a[^] * [^<〉] *</a [^<〉n] *) (?<n〉n s*) (?=<li〉</ul 〉) judge whether the webpage after filtering comprises hierarchical data.
Described hierarchical data processing module 240 is used for the hierarchical data that the webpage after this filtration comprises is handled.At first, the hierarchical relationship of hierarchical data processing module 240 analytic sheaf secondary data, the leafy node of acquisition hierarchical data.In the present embodiment, hierarchical data processing module 240 is utilized the hierarchical relationship of regular expression analytic sheaf secondary data.Secondly, hierarchical data processing module 240 stores hierarchical data into database 17 according to its hierarchical relationship.In the present embodiment, with the similar ground of non-web data, whether hierarchical data processing module 240 at first judgment data storehouse 17 stores corresponding data.If database 17 does not have corresponding data, then directly store this hierarchical data.Otherwise,, then only when the hierarchical relationship of this hierarchical data changes, just this hierarchical data is stored if database 17 has corresponding data.At last, hierarchical data processing module 240 judges whether the leafy node of hierarchical data comprises the link of other data of definite object website 14.
Described Web page text processing module 250 is used for storing the Web page text that the webpage after filtering comprises into database 17.In the present embodiment, with non-web data and the similar ground of hierarchical data, whether Web page text processing module 250 at first judgment data storehouse 17 stores corresponding data.If database 17 does not have corresponding data, then directly store this Web page text.Otherwise,, then only when this Web page text changes, just this Web page text is stored if database 17 has corresponding data.
Described reconstructed module 260 is used for non-web data, hierarchical data and the Web page text of database 17 storages are formed disk-based web site according to its original structure, presents to the local user by client 15.
Consulting shown in Figure 3ly, is the process flow diagram of website of the present invention method for reconstructing preferred embodiment.
Step S301, download module 200 is downloaded the data of targeted website 14 from remote server 12.In the present embodiment, download module 200 is downloaded the data of targeted website 14 according to predefined frequency.For example, download module 200 is downloaded the data of targeted website 14 with weekly frequency from remote server 12.As previously mentioned, packet purse rope page or leaf of described targeted website 14 (for example webpage of html format) and non-web data (for example file of exe, doc or zip form).In the present embodiment, download module 200 is at first downloaded the homepage of targeted website 14.During data download, download module 200 sends the data download request according to procotol to remote server 12, and remote server 12 these data download requests of response are also returned corresponding data.
Step S302, judge module 230 judge whether data downloaded is webpage.
If data downloaded is non-web data, step S303 then, non-webpage processing module 210 should store database 17 into by non-web data.In the present embodiment, whether non-webpage processing module 210 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store this non-web data.Otherwise,, then only when this non-web data changes, just this non-web data is stored if database 17 has corresponding data.
If data downloaded is webpage, step S304 then, 220 pairs of these webpages of filtering module filter, to remove the invalid data in this webpage, for example top margin, page footing, advertisement and the link of pointing to other websites.In the present embodiment, filtering module 220 utilizes the invalid data in the keyword search webpage.For example, for the advertisement in the webpage, utilize key word " broadcast " to search.In the present embodiment, filtering module 220 also carries out correction process to webpage, for example revises incomplete link in the webpage.
Step S305, judge module 230 judges whether the webpage after filtering is unusual webpage.If the webpage after filtering is unusual webpage, then execution in step S313.Described unusual webpage comprises blank or wrong webpage.
If the webpage after filtering is not unusual webpage, step S306 then, judge module 230 judges whether the webpage after this filtration comprises the link of other data of definite object website 14.In the present embodiment, judge module 230 utilizes regular expression to judge whether the webpage after filtering comprises the link of other data of definite object website 14.For example, judge module 230 utilizes regular expression<a href=" "〉that the link of pointing to other data of targeted website 14 in the webpage after filtering is judged.If the webpage after filtering comprises the link of other data of definite object website 14, then return step S301, download module 200 is downloaded the data that this link is pointed to.
If the webpage after this filtration does not comprise the link of other data of definite object website 14, step S307 then, judge module 230 judges whether the webpage after this filtration comprises hierarchical data.Described hierarchical data is the data with hierarchical relationship, for example has the data of tree structure.In the present embodiment, judge module 230 utilizes regular expression to judge whether the webpage after filtering comprises hierarchical data.For example, judge module 230 utilize regular expression (?<li〉<li<a[^] * [^<〉] *</a [^<〉n] *) (?<n〉n s*) (?=<li〉</ul 〉) hierarchical data in the webpage after filtering is judged.
If the webpage after this filtrations comprises hierarchical data, step S308 then, the hierarchical relationship of hierarchical data processing module 240 analytic sheaf secondary data, the leafy node of acquisition hierarchical data.In the present embodiment, hierarchical data processing module 240 is utilized the hierarchical relationship of regular expression analytic sheaf secondary data.
Step S309, hierarchical data processing module 240 stores hierarchical data into database 17 according to its hierarchical relationship.In the present embodiment, whether hierarchical data processing module 240 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store this hierarchical data.Otherwise,, then only when the hierarchical relationship of this hierarchical data changes, just this hierarchical data is stored if database 17 has corresponding data.
Step S310, hierarchical data processing module 240 judges whether the leafy node of hierarchical data comprises the link of other data of definite object website 14.If the leafy node of hierarchical data comprises the link of other data of definite object website 14, then return step S301, download module 200 is downloaded the data that this link is pointed to.If the leafy node of hierarchical data does not comprise the link of other data of definite object website 14, then execution in step S313.
Step S311, judge module 230 judge whether the webpage after this filtration comprises Web page text.If the webpage after this filtration does not comprise Web page text, then execution in step S313.
If the webpage after this filtration comprises Web page text, step S312 then, Web page text processing module 250 stores this Web page text into database 17.In the present embodiment, whether Web page text processing module 250 at first stores corresponding data in judgment data storehouse 17.If database 17 does not have corresponding data, then directly store this Web page text.Otherwise,, then only when this Web page text changes, just this Web page text is stored if database 17 has corresponding data.
Step S313, reconstructed module 260 forms disk-based web site with non-web data, hierarchical data and the Web page text of database 17 storages according to its original structure, presents to the local user by client 15.

Claims (10)

1. a website reconstructing system runs in the home server, and this home server and remote server communicate to connect, and this remote server establishes the targeted website, it is characterized in that, this system comprises:
Download module is used for from the data of remote server download targeted website;
Non-webpage processing module is used for when data downloaded is non-web data, stores this non-web data;
Filtering module is used for when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage;
Judge module, be used to judge whether the webpage after the filtration comprises link, hierarchical data and the Web page text of other data of definite object website, for the link of other data of the definite object website in the webpage after filtering, described download module is downloaded the data of this link sensing from remote server;
The hierarchical data processing module, be used for hierarchical data for the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, described download module is downloaded the data that this link is pointed to from remote server;
The Web page text processing module is used for the Web page text of the webpage behind the stored filter; And
Reconstructed module is used for non-web data, hierarchical data and the Web page text of storage are formed disk-based web site according to original structure.
2. website as claimed in claim 1 reconstructing system is characterized in that described filtering module utilizes key word that webpage is filtered.
3. website as claimed in claim 1 reconstructing system is characterized in that described filtering module also carries out correction process to webpage.
4. website as claimed in claim 1 reconstructing system is characterized in that, described judge module judges at first whether the webpage after the filtration is unusual webpage, if unusual webpage does not then carry out other processing to the webpage after this filtration.
5. website as claimed in claim 1 reconstructing system is characterized in that, described download module is downloaded the data of targeted website according to predefined frequency.
6. a website method for reconstructing is characterized in that, this method comprises:
Download step: the data of downloading the targeted website;
Non-webpage treatment step: when data downloaded is non-web data, store this non-web data;
Filtration step: when data downloaded is webpage, this webpage is filtered, to remove the invalid data in this webpage;
Determining step: judge whether the webpage after filtering comprises link, hierarchical data and the Web page text of other data of definite object website;
Link treatment step: for the link of other data of the definite object website in the webpage after filtering, return download step, download the data of this link sensing from remote server;
Hierarchical data treatment step: for the hierarchical data in the webpage after filtering, resolve the hierarchical relationship of this hierarchical data, obtain the leafy node of hierarchical data, hierarchical data is stored according to its hierarchical relationship, and judge whether the leafy node of hierarchical data comprises the link of other data of definite object website, for the link of pointing to other data of targeted website in the leafy node, return download step, download the data that this link is pointed to from remote server;
Web page text treatment step:, store this Web page text for the Web page text in the webpage after filtering; And
Reconstruction step: non-web data, hierarchical data and the Web page text that will store forms disk-based web site according to original structure.
7. website as claimed in claim 6 method for reconstructing is characterized in that, described filtration step is to utilize key word that webpage is filtered.
8. website as claimed in claim 6 method for reconstructing is characterized in that, described filtration step also comprises webpage is carried out correction process.
9. website as claimed in claim 6 method for reconstructing is characterized in that, comprises also before the described determining step whether the webpage of judging after filtering is unusual webpage, if unusual webpage does not then carry out other processing to the webpage after this filtration.
10. website as claimed in claim 6 method for reconstructing is characterized in that, described download step is to download the data of targeted website according to predefined frequency.
CN2009103118771A 2009-12-21 2009-12-21 Web site rebuilding system and method Pending CN102103622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009103118771A CN102103622A (en) 2009-12-21 2009-12-21 Web site rebuilding system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009103118771A CN102103622A (en) 2009-12-21 2009-12-21 Web site rebuilding system and method

Publications (1)

Publication Number Publication Date
CN102103622A true CN102103622A (en) 2011-06-22

Family

ID=44156398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009103118771A Pending CN102103622A (en) 2009-12-21 2009-12-21 Web site rebuilding system and method

Country Status (1)

Country Link
CN (1) CN102103622A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802893A (en) * 2015-11-26 2017-06-06 财团法人资讯工业策进会 Website method for simplifying and the website simplification device using it
CN110209971A (en) * 2019-05-15 2019-09-06 朱容宇 A kind of method and system of website recombination reduction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802893A (en) * 2015-11-26 2017-06-06 财团法人资讯工业策进会 Website method for simplifying and the website simplification device using it
CN110209971A (en) * 2019-05-15 2019-09-06 朱容宇 A kind of method and system of website recombination reduction

Similar Documents

Publication Publication Date Title
US9940391B2 (en) System, method and computer readable medium for web crawling
CN101127038B (en) System and method for downloading website static web page
CN102254027B (en) Method for obtaining webpage contents in batch
CN106484828B (en) Distributed internet data rapid acquisition system and acquisition method
US8131753B2 (en) Apparatus and method for accessing and indexing dynamic web pages
CN107145556B (en) Universal distributed acquisition system
CN102567407B (en) Method and system for collecting forum reply increment
CN102333122A (en) Downloaded resource provision method, device and system
CN103279567A (en) Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
CN103324669A (en) Method and client for processing web page bookmark
CN102521251A (en) Method for directly realizing personalized search, device for realizing method, and search server
CN102663062A (en) Method and device for processing invalid links in search result
CN101630330A (en) Method for webpage classification
CN102750352A (en) Method and device for classified collection of historical access records in browser
CN103297469A (en) Method and device of collecting website data
CN104182412A (en) Webpage crawling method and webpage crawling system
CN103617213A (en) Method and system for identifying newspage attributive characters
CN103123640A (en) Method and device for searching novel
CN102004805B (en) Webpage denoising system and method based on maximum similarity matching
CN103605742B (en) Recognize the method and device of Internet resources entity catalogue page
CN102103622A (en) Web site rebuilding system and method
CN103955517A (en) Method and system for converting data in documental database to relational database
CN105183843A (en) List page recognition system and method
US8706705B1 (en) System and method for associating data relating to features of a data entity
CN104835052A (en) Method and system for improving network advertisement delivery precision

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110622