CN101727471A - Website content retrieval system and method - Google Patents

Website content retrieval system and method Download PDF

Info

Publication number
CN101727471A
CN101727471A CN200810305300A CN200810305300A CN101727471A CN 101727471 A CN101727471 A CN 101727471A CN 200810305300 A CN200810305300 A CN 200810305300A CN 200810305300 A CN200810305300 A CN 200810305300A CN 101727471 A CN101727471 A CN 101727471A
Authority
CN
China
Prior art keywords
redirect
web page
chained address
module
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810305300A
Other languages
Chinese (zh)
Inventor
常小军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hongfujin Precision Industry Shenzhen Co Ltd
Hon Hai Precision Industry Co Ltd
Original Assignee
Hongfujin Precision Industry Shenzhen Co Ltd
Hon Hai Precision Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hongfujin Precision Industry Shenzhen Co Ltd, Hon Hai Precision Industry Co Ltd filed Critical Hongfujin Precision Industry Shenzhen Co Ltd
Priority to CN200810305300A priority Critical patent/CN101727471A/en
Publication of CN101727471A publication Critical patent/CN101727471A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a website content retrieval method, which comprises the following steps of: (a) receiving key words and web addresses; (b) transmitting requests for accessing the web page corresponding to the current web address to a Web server; (c) receiving the html codes produced by resolving the current web page by the Web server; (d) screening the text information from the html codes; (e) when searching the key words from the screened text information, saving the path of the current web page in a retrieval result list; (f) when a jump link address exists in the html code, extracting the needed jump link addresses, storing the extracted jump link addresses into an array, and then transversing the array, screening the jump link address of the sub webpage from the current webpage as the current webpage and referring to the step (b); and (g) prompting the accomplishment of retrieval when the jump link address does not exist in the html codes. In addition, the invention provides a web site content retrieval system.

Description

Website content retrieval system and method
Technical field
The present invention relates to a kind of searching system and method, relate in particular to a kind of Website content retrieval system and method.
Background technology
Along with development of computer network, the effect of website aspect information issue and transmission from strength to strength, content and data that it comprised also are doubled and redoubled.The expansion of data volume causes the user will accurately find very difficulty of required information.Most of at present mode that adopts is the common search service, the user utilizes each large search engine (as: Baidu, Google etc.) to search for needed data, these search engines are by powerful server and file data index technology, content in the targeted website is carried out index, offer public's inquiry service.
But there has been certain defective in these search engines, and at first, described search engine can only be searched for the web site contents of having included in database.It is the file content retrieval service that the third party service provider provides that search engine is searched, and search engine can be searched for website according to the dictionary of self when including a website, preserves Search Results.For example: A is included by certain search engine the website, and the most contents user in the A website can come out by this search engine searches so.If the B website is not included by this search engine, is the related content that can not find in this B website so in search engine.Set up special search service engine and need consume lot of manpower and material resources, and difficult in maintenance.Secondly, because there are very big difference in the language of website exploitation and runtime server environment, search engine includes to support not to be very perfect for various dynamic websites when including website, and content is included not comprehensive.In addition, search engine adopts the file index technology, includes in the process and can the result of pre-search be preserved at it, and the actual content of website may change, so search may be expired content.
Summary of the invention
In view of above content, be necessary to provide a kind of Website content retrieval system, can retrieve all web site contents that the user needs effectively.
Also be necessary to provide a kind of web site contents search method, can retrieve all web site contents that the user needs effectively.
A kind of Website content retrieval system, this system runs in the client host, this client host comprises a result for retrieval tabulation, this system comprises: receiver module, be used to receive key word and network address, send the request of the current web page of the described network address correspondence of visit to Web server, and receive Web server and resolve the html code that this current webpage produces; The screening module is used for filtering out Word message from described html code; Search module, be used for from the screening after Word message search described key word; Preserve module, be used for when the Word message after screening finds described key word, the path of preserving this current webpage is to the result for retrieval tabulation; Extraction module, be used for search finish after, when having the redirect chained address in the described html code, extract required redirect chained address; Described preservation module also is used for storing the redirect chained address that is extracted into an array; And described screening module, also be used to travel through described array, therefrom filter out the sub-pages redirect chained address of current web page, and each sub-pages redirect chained address is sent to described receiver module.
A kind of web site contents search method, this method comprise the steps: that (a) receives key word and network address; (b) send the request of the current web page of this network address correspondence of visit to Web server; (c) receive Web server and resolve the html code that this current webpage produces; (d) from described html code, filter out Word message; (e) when finding described key word from the Word message after the screening, the path of preserving this current webpage is to the result for retrieval tabulation; (f) when having the redirect chained address in the described html code, extract required redirect chained address, the redirect chained address that is extracted is stored in the array, travel through described array, the sub-pages redirect chained address that therefrom filters out current web page goes to step (b) as current web page; And (g) when not having the redirect chained address in the described html code, search complete in prompting.
Compared to prior art, described Website content retrieval system and method by recursive algorithm level search website contents, do not rely on the retrieval of site databases, can search the web site contents of the required retrieval of user more all sidedly.
Description of drawings
Fig. 1 is the running environment figure of web site contents retrieval control system of the present invention preferred embodiment.
Fig. 2 is the functional block diagram of Website content retrieval system 100 preferred embodiments of the present invention.
Fig. 3 is the operation process chart of web site contents search method of the present invention preferred embodiment.
Embodiment
As shown in Figure 1, be the running environment figure of a kind of Website content retrieval system preferred embodiment of the present invention.This Website content retrieval system 100 runs on the client host 1, and this Website content retrieval system 100 can have the domain name independent operating of oneself, also can be embedded in the webpage 4, and as an assembly in this webpage 4, this webpage can be set to homepage.This Website content retrieval system 100 provides an interface, this interface comprises at least two input fields, the user can import at least one key word in an input field, also can in this input field, import a plurality of key words, distinguish with the space between the key word, import network address in another field, this network address is the superiors' webpage of retrieval, and the website of this network address correspondence also is the targeted website of required retrieval.Described Website content retrieval system 100 finds the webpage of the network address correspondence that the user imports, and searches the key word of being imported in this webpage.The present invention takes multithreading, supports many retrieved webs, and scope that promptly can search key is set to a plurality of websites.
This client host 1 comprises a result for retrieval tabulation, and this result for retrieval tabulation is used to store web page address or the web page contents that this Website content retrieval system 100 is retrieved.
Described client host 1 is connected in Web server 2, and this Web server 2 is used for by network 3 all webpages 4 of visit, and described network 3 can be internet, intranet or other any suitable telecommunication medias.
As shown in Figure 2, be the functional block diagram of Website content retrieval system 100 preferred embodiments of the present invention.Described module is the software program section with specific function, and this software is stored in computer-readable recording medium or other memory device, can be carried out by computing machine or other calculation element that comprises processor, thereby finish the serial flow process that web site contents is retrieved.Described Website content retrieval system 100 comprises: receiver module 10, screening module 12, search module 14, preserve module 16, extraction module 18 and reminding module 20.
Receiver module 10 is used for receiving key word that the user imports at input field and the network address that will retrieve.This receiver module 10 also is used for sending to Web server 2 request of the webpage of visiting the network address correspondence that is received after the network address that receives the needs retrieval, the webpage of this network address correspondence is called current web page, and receives the Web server 2 parsing html codes that current web page produced.In this preferred embodiment, when the user imported key word and the network address that will retrieve in input field after, Web server 2 was resolved the content of the webpage of this network address correspondence immediately.If the development language of website employing itself is dynamic language (JSP for example, ASP .NET etc.), this Web server 2 is resolved described dynamic language program and is generated the html code, and this html code returned to client host 1, described receiver module 10 receives these html codes.
Screening module 12 is used for filtering out Word message from the html code that receiver module 10 is received.Make up the needed standard html label of the page, for example form label<table because comprise in the html code 〉, layer label div etc., described standard html label is not the content of web displaying, just is used for the interface of modified web page.Therefore should screening module 12 screen all the html labels except redirect links (<a〉<a 〉), the remaining Word message that is webpage.
Search module 14 be used for the Word message after screening search the key word that whether exists described receiver module 10 to be received.
Preserve module 16 and be used for when the Word message after the screening finds described key word, the path of preserving this current webpage is to the result for retrieval tabulation.
Extraction module 18 be used to search finish after, judge whether there is redirect link in the described html code, if exist, extract all redirects links in the described html code.In the present embodiment, the form of described redirect link is:<ahref=http: //xxx.com〉literal</a 〉.
Described preservation module 16 also is used for storing the redirect link that is extracted into an array.
Described screening module 12 also is used to travel through described array, therefrom filters out the all-ones subnet page or leaf redirect chained address of current web page, and this each sub-pages redirect chained address is sent to receiver module 10.Described receiver module 10 sends the request of each selected redirect chained address of visit to described Web server 2.Not only comprise the link of the sub-pages of current web page in the redirect link, also comprise redirect chained addresses such as friendly link or network linking.Described screening module 12 will not belong to the redirect chained address of this website and screen.In this preferred embodiment, this screening module 12 checks whether the path, redirect chained address in the described array is the subpath in the current web page path of preservation in the result for retrieval tabulation, the domain name of the corresponding website of network address of also promptly judging the need the retrieval whether forward part of the domain name of each redirect chained address is all received with receiver module 10 is identical, if it is identical, then this redirect chained address belongs to this website, otherwise shows that this redirect chained address does not belong to this website.Simultaneously cause endless loop, the domain name length of the network address that the domain name length of the redirect chained address of being preserved in the array must be received greater than receiver module 10 for fear of occurring the webpage in the same network address retrieved repeatedly.Domain name path as the network address that connect is: http://www.abc.com, the domain name addresses of the redirect linked web pages of preserving in the then described array must comprise this path, and length must such as the redirect chained address be greater than this path: http://www.abc.com/xxx just meets the requirements.
Reminding module 20 is used for when there is not the redirect link in described html code, and search complete in prompting.
As shown in Figure 3, be the operation process chart of web site contents search method of the present invention preferred embodiment.
Step S30, key word that receiver module 10 reception users are imported in input field and the network address that needs retrieval.
Step S32, this receiver module 10 sends the request of the corresponding webpage of network address that visit received to Web server 2, and the webpage of this network address correspondence is called current web page.
Step S34, receiver module 10 receives Web server 2 and resolves the html code that this current webpage produced.In this preferred embodiment, when the user imported key word and needs the network address of retrieval in input field after, Web server 2 was resolved the content of the webpage of this network address correspondence immediately.If the development language of website employing itself is dynamic language (JSP for example, ASP .NET etc.), this Web server 2 is resolved described dynamic language program and is generated the html code, and this html code returned to client host 1, described receiver module 10 receives these html codes.
Step S36, screening module 12 filters out Word message from the html code that receiver module 10 is received.Make up the needed standard html label of the page, for example form label<table because comprise in the html code 〉, layer label div etc., described label is not the content of web displaying, just is used for the interface of modified web page.Therefore should screening module 12 screen all the html labels except redirect links (<a〉<a 〉), the remaining Word message that is webpage.
Step S38, search module 14 search the key word that whether exists described receiver module 10 to be received in the Word message after screening.If search when finding described key word in the Word message of module 14 after screening, enter step S40.If search when not finding described key word in the Word message of module 14 after screening, enter step S42.
Step S40, preservation module 16 is preserved the path of these current webpages to the result for retrieval tabulation, and enters step S42.
Step S42, extraction module 18 judge whether there is the redirect link in the described html code.If exist, enter step S44.If do not exist, enter step S48.
Step S44, extraction module 18 extract all the redirect links in the described html code.In the present embodiment, the form of described redirect link is:<ahref=http: //xxx.com〉literal</a 〉.Described preservation module 16 stores the redirect link that is extracted in the array into.
Step S46, the described array of described screening module 12 traversals, therefrom filter out the all-ones subnet page or leaf redirect chained address of current web page, and each sub-pages redirect chained address sent to receiver module 10, go to step S32, described receiver module 10 sends the request of each selected redirect chained address of visit to described Web server 2.Not only comprise the link of the sub-pages of current web page in the redirect link, also comprise redirect chained addresses such as friendly link or network linking.Described screening module 12 will not belong to the redirect chained address of this website and screen.In this preferred embodiment, this screening module 12 checks whether the path, redirect chained address in the described array is the subpath in the path of the current web page of preservation in the result for retrieval tabulation, the domain name of also promptly judging the corresponding website of the network address whether forward part of the domain name of each redirect chained address is all received with receiver module 10 is identical, if it is identical, then this redirect chained address belongs to this website, otherwise shows that this redirect chained address does not belong to this website.Simultaneously cause endless loop, the domain name length of the network address that the redirect chained address domain name length of being preserved in the array must be received greater than receiver module 10 for fear of occurring the webpage in the same network address retrieved repeatedly.Domain name path as the network address that connect is: http://www.abc.com, the domain name addresses of the redirect linked web pages of preserving in the then described array must comprise this path, and length must such as the redirect chained address be greater than this path: http://www.abc.com/xxx just meets the requirements.
Step S48, search complete in reminding module 20 promptings.
It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can make amendment or be equal to replacement technical scheme of the present invention, and not break away from the spirit and scope of technical solution of the present invention.

Claims (7)

1. Website content retrieval system, this system runs in the client host, and this client host comprises a result for retrieval tabulation, it is characterized in that this system comprises:
Receiver module is used to receive key word and network address, sends the request of the current web page of the described network address correspondence of visit to Web server, and receives Web server and resolve the html code that this current webpage produces;
The screening module is used for filtering out Word message from described html code;
Search module, be used for from the screening after Word message search described key word;
Preserve module, be used for when the Word message after screening finds described key word, the path of preserving this current webpage is to the result for retrieval tabulation;
Extraction module, be used for search finish after, when having the redirect chained address in the described html code, extract required redirect chained address;
Described preservation module also is used for storing the redirect chained address that is extracted into an array; And
Described screening module also is used to travel through described array, therefrom filters out the sub-pages redirect chained address of current web page, and each sub-pages redirect chained address is sent to described receiver module.
2. Website content retrieval system as claimed in claim 1 is characterized in that this system also comprises reminding module, is used for when there is not the redirect link in described html code, and search complete in prompting.
3. Website content retrieval system as claimed in claim 1 is characterized in that, the redirect chained address domain name length of the sub-pages of described current web page is greater than the domain name length of current web page.
4. Website content retrieval system as claimed in claim 1 is characterized in that, described key word is one or more, if a plurality of key word is then distinguished with the space between each key word.
5. a web site contents search method is characterized in that, this method comprises the steps:
(a) receive key word and network address;
(b) send the request of the current web page of this network address correspondence of visit to Web server;
(c) receive Web server and resolve the html code that this current webpage produces;
(d) from described html code, filter out Word message;
(e) when finding described key word from the Word message after the screening, the path of preserving this current webpage is to the result for retrieval tabulation;
(f) when having the redirect chained address in the described html code, extract required redirect chained address, the redirect chained address that is extracted is stored in the array, travel through described array, the sub-pages redirect chained address that therefrom filters out current web page goes to step (b) as current web page; And
(g) when not having the redirect chained address in the described html code, search complete in prompting.
6. web site contents search method as claimed in claim 5 is characterized in that, the redirect chained address domain name length of the sub-pages of described current web page is greater than the domain name length of current web page.
7. web site contents search method as claimed in claim 5 is characterized in that, described key word is one or more, if a plurality of key word is then distinguished with the space between each key word.
CN200810305300A 2008-10-30 2008-10-30 Website content retrieval system and method Pending CN101727471A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810305300A CN101727471A (en) 2008-10-30 2008-10-30 Website content retrieval system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810305300A CN101727471A (en) 2008-10-30 2008-10-30 Website content retrieval system and method

Publications (1)

Publication Number Publication Date
CN101727471A true CN101727471A (en) 2010-06-09

Family

ID=42448367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810305300A Pending CN101727471A (en) 2008-10-30 2008-10-30 Website content retrieval system and method

Country Status (1)

Country Link
CN (1) CN101727471A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402546A (en) * 2010-09-08 2012-04-04 腾讯科技(深圳)有限公司 Webpage content display method and system
CN103810177A (en) * 2012-11-07 2014-05-21 江苏仕德伟网络科技股份有限公司 Method for accurately obtaining real dwell time of website visitor on webpages
CN105808545A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Forum data extraction method and forum data extraction apparatus
CN106021489A (en) * 2016-05-19 2016-10-12 乐视控股(北京)有限公司 Keyword linking method, apparatus and system
CN106547821A (en) * 2016-09-29 2017-03-29 广东工业大学 A kind of method in browser according to keyword search related web page
CN109086414A (en) * 2018-08-03 2018-12-25 上海点融信息科技有限责任公司 For searching for the method, apparatus and storage medium of block chain data
CN109635224A (en) * 2018-12-11 2019-04-16 北京知道创宇信息技术有限公司 Record reference information, the method, apparatus in inquiry reference path
CN113676374A (en) * 2021-08-13 2021-11-19 杭州安恒信息技术股份有限公司 Target website clue detection method, device, computer equipment and medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402546A (en) * 2010-09-08 2012-04-04 腾讯科技(深圳)有限公司 Webpage content display method and system
CN103810177A (en) * 2012-11-07 2014-05-21 江苏仕德伟网络科技股份有限公司 Method for accurately obtaining real dwell time of website visitor on webpages
CN105808545A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Forum data extraction method and forum data extraction apparatus
CN106021489A (en) * 2016-05-19 2016-10-12 乐视控股(北京)有限公司 Keyword linking method, apparatus and system
WO2017197889A1 (en) * 2016-05-19 2017-11-23 乐视控股(北京)有限公司 Keyword link method, device and system
CN106547821A (en) * 2016-09-29 2017-03-29 广东工业大学 A kind of method in browser according to keyword search related web page
CN109086414A (en) * 2018-08-03 2018-12-25 上海点融信息科技有限责任公司 For searching for the method, apparatus and storage medium of block chain data
CN109635224A (en) * 2018-12-11 2019-04-16 北京知道创宇信息技术有限公司 Record reference information, the method, apparatus in inquiry reference path
CN113676374A (en) * 2021-08-13 2021-11-19 杭州安恒信息技术股份有限公司 Target website clue detection method, device, computer equipment and medium
CN113676374B (en) * 2021-08-13 2024-03-22 杭州安恒信息技术股份有限公司 Target website clue detection method, device, computer equipment and medium

Similar Documents

Publication Publication Date Title
US8417695B2 (en) Identifying related concepts of URLs and domain names
US9686374B2 (en) System and method for fragment level dynamic content regeneration
CN101727471A (en) Website content retrieval system and method
Srikant et al. Mining web logs to improve website organization
CN102722563B (en) Method and device for displaying page
US10346483B2 (en) System and method for search engine optimization
CN106484828B (en) Distributed internet data rapid acquisition system and acquisition method
CN101599089B (en) Method and system for automatically searching and extracting update information on content of video service website
CN100565518C (en) A kind of method and system that keep page current data information
US6385629B1 (en) System and method for the automatic mining of acronym-expansion pairs patterns and formation rules
US6981037B1 (en) Method and system for using access patterns to improve web site hierarchy and organization
CN103389983A (en) Webpage content grabbing method and device applied to network crawler system
CN102521251A (en) Method for directly realizing personalized search, device for realizing method, and search server
WO2010094927A1 (en) Content access platform and methods and apparatus providing access to internet content for heterogeneous devices
US20090083266A1 (en) Techniques for tokenizing urls
CN101916285A (en) Method and device for analyzing internet web page contents
US20150100563A1 (en) Method for retaining search engine optimization in a transferred website
CN101211340A (en) Dynamic network crawler based on client end /service end
CN1960371B (en) Method and system for accessing file of Web application program
CN102622402B (en) Server, method and system for providing information search service by using sheaf of pages
CN104065736A (en) URL redirection method, device, and system
US9529922B1 (en) Computer implemented systems and methods for dynamic and heuristically-generated search returns of particular relevance
CN103617225B (en) A kind of associating web pages searching method and system
CN110955855B (en) Information interception method, device and terminal
CN103618742A (en) Method and system for acquiring sub domain names and webmaster permission verification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20100609