CN101727471A

CN101727471A - Website content retrieval system and method

Info

Publication number: CN101727471A
Application number: CN200810305300A
Authority: CN
Inventors: 常小军
Original assignee: Hongfujin Precision Industry Shenzhen Co Ltd; Hon Hai Precision Industry Co Ltd
Current assignee: Hongfujin Precision Industry Shenzhen Co Ltd; Hon Hai Precision Industry Co Ltd
Priority date: 2008-10-30
Filing date: 2008-10-30
Publication date: 2010-06-09

Abstract

The invention relates to a website content retrieval method, which comprises the following steps of: (a) receiving key words and web addresses; (b) transmitting requests for accessing the web page corresponding to the current web address to a Web server; (c) receiving the html codes produced by resolving the current web page by the Web server; (d) screening the text information from the html codes; (e) when searching the key words from the screened text information, saving the path of the current web page in a retrieval result list; (f) when a jump link address exists in the html code, extracting the needed jump link addresses, storing the extracted jump link addresses into an array, and then transversing the array, screening the jump link address of the sub webpage from the current webpage as the current webpage and referring to the step (b); and (g) prompting the accomplishment of retrieval when the jump link address does not exist in the html codes. In addition, the invention provides a web site content retrieval system.

Description

Website content retrieval system and method

Technical field

The present invention relates to a kind of searching system and method, relate in particular to a kind of Website content retrieval system and method.

Background technology

Along with development of computer network, the effect of website aspect information issue and transmission from strength to strength, content and data that it comprised also are doubled and redoubled.The expansion of data volume causes the user will accurately find very difficulty of required information.Most of at present mode that adopts is the common search service, the user utilizes each large search engine (as: Baidu, Google etc.) to search for needed data, these search engines are by powerful server and file data index technology, content in the targeted website is carried out index, offer public's inquiry service.

But there has been certain defective in these search engines, and at first, described search engine can only be searched for the web site contents of having included in database.It is the file content retrieval service that the third party service provider provides that search engine is searched, and search engine can be searched for website according to the dictionary of self when including a website, preserves Search Results.For example: A is included by certain search engine the website, and the most contents user in the A website can come out by this search engine searches so.If the B website is not included by this search engine, is the related content that can not find in this B website so in search engine.Set up special search service engine and need consume lot of manpower and material resources, and difficult in maintenance.Secondly, because there are very big difference in the language of website exploitation and runtime server environment, search engine includes to support not to be very perfect for various dynamic websites when including website, and content is included not comprehensive.In addition, search engine adopts the file index technology, includes in the process and can the result of pre-search be preserved at it, and the actual content of website may change, so search may be expired content.

Summary of the invention

In view of above content, be necessary to provide a kind of Website content retrieval system, can retrieve all web site contents that the user needs effectively.

Also be necessary to provide a kind of web site contents search method, can retrieve all web site contents that the user needs effectively.

A kind of Website content retrieval system, this system runs in the client host, this client host comprises a result for retrieval tabulation, this system comprises: receiver module, be used to receive key word and network address, send the request of the current web page of the described network address correspondence of visit to Web server, and receive Web server and resolve the html code that this current webpage produces; The screening module is used for filtering out Word message from described html code; Search module, be used for from the screening after Word message search described key word; Preserve module, be used for when the Word message after screening finds described key word, the path of preserving this current webpage is to the result for retrieval tabulation; Extraction module, be used for search finish after, when having the redirect chained address in the described html code, extract required redirect chained address; Described preservation module also is used for storing the redirect chained address that is extracted into an array; And described screening module, also be used to travel through described array, therefrom filter out the sub-pages redirect chained address of current web page, and each sub-pages redirect chained address is sent to described receiver module.

A kind of web site contents search method, this method comprise the steps: that (a) receives key word and network address; (b) send the request of the current web page of this network address correspondence of visit to Web server; (c) receive Web server and resolve the html code that this current webpage produces; (d) from described html code, filter out Word message; (e) when finding described key word from the Word message after the screening, the path of preserving this current webpage is to the result for retrieval tabulation; (f) when having the redirect chained address in the described html code, extract required redirect chained address, the redirect chained address that is extracted is stored in the array, travel through described array, the sub-pages redirect chained address that therefrom filters out current web page goes to step (b) as current web page; And (g) when not having the redirect chained address in the described html code, search complete in prompting.

Compared to prior art, described Website content retrieval system and method by recursive algorithm level search website contents, do not rely on the retrieval of site databases, can search the web site contents of the required retrieval of user more all sidedly.

Description of drawings

Fig. 1 is the running environment figure of web site contents retrieval control system of the present invention preferred embodiment.

Fig. 2 is the functional block diagram of Website content retrieval system 100 preferred embodiments of the present invention.

Fig. 3 is the operation process chart of web site contents search method of the present invention preferred embodiment.

Embodiment

As shown in Figure 1, be the running environment figure of a kind of Website content retrieval system preferred embodiment of the present invention.This Website content retrieval system 100 runs on the client host 1, and this Website content retrieval system 100 can have the domain name independent operating of oneself, also can be embedded in the webpage 4, and as an assembly in this webpage 4, this webpage can be set to homepage.This Website content retrieval system 100 provides an interface, this interface comprises at least two input fields, the user can import at least one key word in an input field, also can in this input field, import a plurality of key words, distinguish with the space between the key word, import network address in another field, this network address is the superiors' webpage of retrieval, and the website of this network address correspondence also is the targeted website of required retrieval.Described Website content retrieval system 100 finds the webpage of the network address correspondence that the user imports, and searches the key word of being imported in this webpage.The present invention takes multithreading, supports many retrieved webs, and scope that promptly can search key is set to a plurality of websites.

This client host 1 comprises a result for retrieval tabulation, and this result for retrieval tabulation is used to store web page address or the web page contents that this Website content retrieval system 100 is retrieved.

Described client host 1 is connected in Web server 2, and this Web server 2 is used for by network 3 all webpages 4 of visit, and described network 3 can be internet, intranet or other any suitable telecommunication medias.

As shown in Figure 2, be the functional block diagram of Website content retrieval system 100 preferred embodiments of the present invention.Described module is the software program section with specific function, and this software is stored in computer-readable recording medium or other memory device, can be carried out by computing machine or other calculation element that comprises processor, thereby finish the serial flow process that web site contents is retrieved.Described Website content retrieval system 100 comprises: receiver module 10, screening module 12, search module 14, preserve module 16, extraction module 18 and reminding module 20.

Receiver module 10 is used for receiving key word that the user imports at input field and the network address that will retrieve.This receiver module 10 also is used for sending to Web server 2 request of the webpage of visiting the network address correspondence that is received after the network address that receives the needs retrieval, the webpage of this network address correspondence is called current web page, and receives the Web server 2 parsing html codes that current web page produced.In this preferred embodiment, when the user imported key word and the network address that will retrieve in input field after, Web server 2 was resolved the content of the webpage of this network address correspondence immediately.If the development language of website employing itself is dynamic language (JSP for example, ASP .NET etc.), this Web server 2 is resolved described dynamic language program and is generated the html code, and this html code returned to client host 1, described receiver module 10 receives these html codes.

Screening module 12 is used for filtering out Word message from the html code that receiver module 10 is received.Make up the needed standard html label of the page, for example form label＜table because comprise in the html code 〉, layer label div etc., described standard html label is not the content of web displaying, just is used for the interface of modified web page.Therefore should screening module 12 screen all the html labels except redirect links (＜a〉＜a 〉), the remaining Word message that is webpage.

Search module 14 be used for the Word message after screening search the key word that whether exists described receiver module 10 to be received.

Preserve module 16 and be used for when the Word message after the screening finds described key word, the path of preserving this current webpage is to the result for retrieval tabulation.

Extraction module 18 be used to search finish after, judge whether there is redirect link in the described html code, if exist, extract all redirects links in the described html code.In the present embodiment, the form of described redirect link is:＜ahref=http: //xxx.com〉literal＜/a 〉.

Described preservation module 16 also is used for storing the redirect link that is extracted into an array.

Described screening module 12 also is used to travel through described array, therefrom filters out the all-ones subnet page or leaf redirect chained address of current web page, and this each sub-pages redirect chained address is sent to receiver module 10.Described receiver module 10 sends the request of each selected redirect chained address of visit to described Web server 2.Not only comprise the link of the sub-pages of current web page in the redirect link, also comprise redirect chained addresses such as friendly link or network linking.Described screening module 12 will not belong to the redirect chained address of this website and screen.In this preferred embodiment, this screening module 12 checks whether the path, redirect chained address in the described array is the subpath in the current web page path of preservation in the result for retrieval tabulation, the domain name of the corresponding website of network address of also promptly judging the need the retrieval whether forward part of the domain name of each redirect chained address is all received with receiver module 10 is identical, if it is identical, then this redirect chained address belongs to this website, otherwise shows that this redirect chained address does not belong to this website.Simultaneously cause endless loop, the domain name length of the network address that the domain name length of the redirect chained address of being preserved in the array must be received greater than receiver module 10 for fear of occurring the webpage in the same network address retrieved repeatedly.Domain name path as the network address that connect is: http://www.abc.com, the domain name addresses of the redirect linked web pages of preserving in the then described array must comprise this path, and length must such as the redirect chained address be greater than this path: http://www.abc.com/xxx just meets the requirements.

Reminding module 20 is used for when there is not the redirect link in described html code, and search complete in prompting.

As shown in Figure 3, be the operation process chart of web site contents search method of the present invention preferred embodiment.

Step S30, key word that receiver module 10 reception users are imported in input field and the network address that needs retrieval.

Step S32, this receiver module 10 sends the request of the corresponding webpage of network address that visit received to Web server 2, and the webpage of this network address correspondence is called current web page.

Step S34, receiver module 10 receives Web server 2 and resolves the html code that this current webpage produced.In this preferred embodiment, when the user imported key word and needs the network address of retrieval in input field after, Web server 2 was resolved the content of the webpage of this network address correspondence immediately.If the development language of website employing itself is dynamic language (JSP for example, ASP .NET etc.), this Web server 2 is resolved described dynamic language program and is generated the html code, and this html code returned to client host 1, described receiver module 10 receives these html codes.

Step S36, screening module 12 filters out Word message from the html code that receiver module 10 is received.Make up the needed standard html label of the page, for example form label＜table because comprise in the html code 〉, layer label div etc., described label is not the content of web displaying, just is used for the interface of modified web page.Therefore should screening module 12 screen all the html labels except redirect links (＜a〉＜a 〉), the remaining Word message that is webpage.

Step S38, search module 14 search the key word that whether exists described receiver module 10 to be received in the Word message after screening.If search when finding described key word in the Word message of module 14 after screening, enter step S40.If search when not finding described key word in the Word message of module 14 after screening, enter step S42.

Step S40, preservation module 16 is preserved the path of these current webpages to the result for retrieval tabulation, and enters step S42.

Step S42, extraction module 18 judge whether there is the redirect link in the described html code.If exist, enter step S44.If do not exist, enter step S48.

Step S44, extraction module 18 extract all the redirect links in the described html code.In the present embodiment, the form of described redirect link is:＜ahref=http: //xxx.com〉literal＜/a 〉.Described preservation module 16 stores the redirect link that is extracted in the array into.

Step S46, the described array of described screening module 12 traversals, therefrom filter out the all-ones subnet page or leaf redirect chained address of current web page, and each sub-pages redirect chained address sent to receiver module 10, go to step S32, described receiver module 10 sends the request of each selected redirect chained address of visit to described Web server 2.Not only comprise the link of the sub-pages of current web page in the redirect link, also comprise redirect chained addresses such as friendly link or network linking.Described screening module 12 will not belong to the redirect chained address of this website and screen.In this preferred embodiment, this screening module 12 checks whether the path, redirect chained address in the described array is the subpath in the path of the current web page of preservation in the result for retrieval tabulation, the domain name of also promptly judging the corresponding website of the network address whether forward part of the domain name of each redirect chained address is all received with receiver module 10 is identical, if it is identical, then this redirect chained address belongs to this website, otherwise shows that this redirect chained address does not belong to this website.Simultaneously cause endless loop, the domain name length of the network address that the redirect chained address domain name length of being preserved in the array must be received greater than receiver module 10 for fear of occurring the webpage in the same network address retrieved repeatedly.Domain name path as the network address that connect is: http://www.abc.com, the domain name addresses of the redirect linked web pages of preserving in the then described array must comprise this path, and length must such as the redirect chained address be greater than this path: http://www.abc.com/xxx just meets the requirements.

Step S48, search complete in reminding module 20 promptings.

It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can make amendment or be equal to replacement technical scheme of the present invention, and not break away from the spirit and scope of technical solution of the present invention.

Claims

1. Website content retrieval system, this system runs in the client host, and this client host comprises a result for retrieval tabulation, it is characterized in that this system comprises:

Receiver module is used to receive key word and network address, sends the request of the current web page of the described network address correspondence of visit to Web server, and receives Web server and resolve the html code that this current webpage produces;

The screening module is used for filtering out Word message from described html code;

Search module, be used for from the screening after Word message search described key word;

Preserve module, be used for when the Word message after screening finds described key word, the path of preserving this current webpage is to the result for retrieval tabulation;

Extraction module, be used for search finish after, when having the redirect chained address in the described html code, extract required redirect chained address;

Described preservation module also is used for storing the redirect chained address that is extracted into an array; And

Described screening module also is used to travel through described array, therefrom filters out the sub-pages redirect chained address of current web page, and each sub-pages redirect chained address is sent to described receiver module.

2. Website content retrieval system as claimed in claim 1 is characterized in that this system also comprises reminding module, is used for when there is not the redirect link in described html code, and search complete in prompting.

3. Website content retrieval system as claimed in claim 1 is characterized in that, the redirect chained address domain name length of the sub-pages of described current web page is greater than the domain name length of current web page.

4. Website content retrieval system as claimed in claim 1 is characterized in that, described key word is one or more, if a plurality of key word is then distinguished with the space between each key word.

5. a web site contents search method is characterized in that, this method comprises the steps:

(a) receive key word and network address;

(b) send the request of the current web page of this network address correspondence of visit to Web server;

(c) receive Web server and resolve the html code that this current webpage produces;

(d) from described html code, filter out Word message;

(e) when finding described key word from the Word message after the screening, the path of preserving this current webpage is to the result for retrieval tabulation;

(f) when having the redirect chained address in the described html code, extract required redirect chained address, the redirect chained address that is extracted is stored in the array, travel through described array, the sub-pages redirect chained address that therefrom filters out current web page goes to step (b) as current web page; And

(g) when not having the redirect chained address in the described html code, search complete in prompting.

6. web site contents search method as claimed in claim 5 is characterized in that, the redirect chained address domain name length of the sub-pages of described current web page is greater than the domain name length of current web page.

7. web site contents search method as claimed in claim 5 is characterized in that, described key word is one or more, if a plurality of key word is then distinguished with the space between each key word.