A kind of internet subject file search method, crawler system and search engine
Technical field
The present invention relates to the internet document search, relate in particular to a kind of internet subject file search method, and corresponding crawler system and search engine.
Background technology
Intemet has become a most popular technology of computer realm, and the universal people of making of Internet can break through the restriction of space, region, shares information resources easily.Www is main, the most widely used a kind of information service that provides on the Internet, since being born, obtained fast development, become a huge information bank, stored a large amount of valuable information, people can find own interested various contents thereon.But in actual use, the online huge data volume of web brings great difficulty can for user's information inquiry work.In this case, various information retrieval services are arisen at the historic moment, and global search technology is an important information retrieval technique that extensively adopts.At present, global search technology based on the www net is just obtaining increasingly extensive application, the large-scale full-text search instrument that much has much influence has been arranged, there is www.soso.com in wherein more famous Chinese search engine system, www.baidu.com etc., the application of these text retrieval systems has been played huge effect to the inquiry of the online document information of www.
Internet search engine generally is made up of crawler system, directory system, searching system at present, crawler system need be gathered webpage and various file from different website on the network, such as web webpage, mp3 file etc., give directory system then and set up index data base, searching system receives user's retrieval request, the search index database returns the result who meets user's request.
General internet search engine system architecture comprises as shown in Figure 1:
Web page server: the web page access service of Chinese search engine system is provided, and is the user interface that the user uses the Chinese search engine system;
Searching system: the search key search index database according to the user submits to, according to certain algorithm the document that meets Search Requirement is sorted, filters, return to web page server;
Directory system: the document to the crawler system collection is handled, and sets up index data base;
Crawler system: gather pages of Internet and various document data.
Prior art one: gather all web website and webpage.
Carry out the particular interconnect host and inscribe in the search engine of file search, its crawler system is generally only gathered the file of particular topic, sets up index then, and retrieval is provided.But gather the file of particular topic, need to gather webpage, find URL(uniform resource locator) (Uniform Resource Locator, URL) link of particular topic file.
Crawler system generally adopts all webpages of traversal at present, promptly gathers all webpages and file, preserves the file of the particular topic that needs then.Because the webpage that contains the particular topic file seldom, the efficient that causes downloading the particular topic file is very low, downloads several ten thousand webpages and just includes a particular topic file, but also be likely dead chain.Therefore need a kind of technology to improve the probability of downloading the webpage that comprises the particular topic file.
Prior art two: gather specific subject web site and webpage.
According to the webpage of gathering is analyzed, find that the link between general webpage has following feature: theme aggregation and locality.Webpage generally has this two characteristics, and it is bigger that locality determines that the webpage of same main frame links likelihood ratio mutually, and it is big that the theme aggregation determines that the webpage of same theme links probability mutually.
Link properties between the webpage can be carried out analog representation with Fig. 2, and among Fig. 2, each circle is represented a webpage, and the solid circles representative comprises the webpage of mp3 file; Suppose to need to gather mp3 file, demonstrate link between the webpage of theme of news and musical theme and the mp3 file that comprises among Fig. 2, the result shows: link is many mutually between the webpage of theme of news, link is many mutually between the webpage of musical theme, and the web page interlinkage between musical theme and the theme of news is fewer.The URL probability that the webpage of musical theme comprises mp3 file is bigger than the URL probability of the mp3 file that the webpage of theme of news comprises.
Therefore, adopt the method that specific subject web page is searched in the prior art two.With above-mentioned collection mp3 file is example, and the crawler system of MP3 search engine is gathered musical theme website and webpage, and the efficient of finding and gathering mp3 file can be than higher.
Although the collecting efficiency of prior art two is higher, owing to only gather specific minority website, cause the particular topic file of whole collection fewer, can't gather file as much as possible on the internet.
Summary of the invention
The invention provides a kind of internet subject file search method, low or gather incomplete problem in order to solve the searching for Internet subject document efficient that exists in the prior art.
For solveing the technical problem, the technical solution used in the present invention is, a kind of internet subject file search method is provided, and this method comprises:
A, parsing web pages downloaded are extracted the uniform resource position mark URL that comprises in the webpage;
B, calculating comprise the Web page subject branch of gathering webpage of described URL, and the URL theme branch of described Web page subject branch as described URL adds up;
Determine the corresponding priority of described URL according to the score value size of described URL theme branch;
C, from high to low each URL of acquisition order according to priority set up index, search for required internet subject file.
According to said method of the present invention, also comprise:
Preserve the URL historical record of having gathered;
Among the described step B, judge that according to described historical record whether download the URL that comprises in the webpage gathers, only determines priority to the URL that did not gather.
According to said method of the present invention, also comprise:
The url filtering condition is set, only the URL that does not meet described filtercondition that did not gather is determined priority.
Described Web page subject divides concrete computing formula to be:
F(p)=a×numFileLink×FactorLink+b*numKeyWord×FactorWord;
In the formula, the Web page subject branch of F (p) for calculating;
The number of the subject document URL that numFileLink contains for this webpage;
FactorLink is the integrating factor of URL link;
The subject key words number that numKeyWord contains for this webpage;
The be the theme integrating factor of keyword of FactorWord;
A, b are weight factor, and a+b=1.
Simultaneously, the present invention also provides a kind of crawler system of search engine, comprising: URL queue stores module, webpage and file download module, webpage parsing module and acquisition control module;
Described URL queue stores module, according to priority sequential storage URL to be collected;
Described webpage and file download module are pressed URL priority progressive download webpage or file from high to low; Web pages downloaded is sent to described webpage parsing module, and the directory system that downloaded files is sent to search engine is handled;
Described webpage parsing module is resolved webpage, extracts the URL that comprises in the webpage and sends to described acquisition control module;
Described acquisition control module is calculated the Web page subject branch of gathering webpage that comprises described URL, and the URL theme branch of described Web page subject branch as described URL adds up;
The score value size of dividing according to described URL theme is determined the priority of described URL, and this URL is deposited in the corresponding priority query in the described URL queue stores module by its priority.
According to above-mentioned crawler system provided by the invention, comprise that also the url filtering module is connected between described webpage parsing module and the acquisition control module;
Described url filtering module judges whether the URL that described webpage parsing module parses gathers, only keeps the URL that did not gather; And further whether the URL that do not gather of judgement meets the url filtering condition of setting, and the URL that did not gather that only will not meet described filtercondition sends to described acquisition control module.
Corresponding to described crawler system, the present invention also provides a kind of search engine, comprise crawler system, directory system and searching system, described crawler system comprises: URL queue stores module, webpage and file download module, webpage parsing module and acquisition control module;
Described URL queue stores module, according to priority sequential storage URL to be collected;
Described webpage and file download module are pressed URL priority progressive download webpage or file from high to low; Web pages downloaded is sent to described webpage parsing module, and the directory system that downloaded files is sent to search engine is handled;
Described webpage parsing module is resolved webpage, extracts the URL that comprises in the webpage and sends to described acquisition control module;
Described acquisition control module is calculated the Web page subject branch of gathering webpage that comprises described URL, and the URL theme branch of described Web page subject branch as described URL adds up; The score value size of dividing according to described URL theme is determined the priority of described URL, and this URL is deposited in the corresponding priority query in the described URL queue stores module by its priority.
Beneficial effect of the present invention is as follows:
(1) the present invention downloads webpage by resolving, and extracts the uniform resource position mark URL that comprises in the webpage; Each URL is determined priority according to pre-defined rule, preferentially gather the higher URL of priority, search for required subject document; Because the URL that priority is higher and the relation of subject document are closer, the possibility that searches out the related subject file is bigger, therefore, adopts the present invention can improve search efficiency.
(2) the present invention is not limited to some specific website is searched for, and can search for each related web page according to URL priority, therefore, can accomplish at the enterprising line search of whole Internet.
Description of drawings
Fig. 1 is a prior art Chinese information retrieval system Organization Chart;
Fig. 2 is the web page interlinkage synoptic diagram between the different themes;
Fig. 3 is a crawler system structural representation provided by the invention;
Fig. 4 is the inventive method process flow diagram.
Embodiment
Referring to Fig. 3, be crawler system 1 structural representation provided by the invention.Comprise: webpage and file download module 11, webpage parsing module 12, url filtering module 13, acquisition control module 14 and URL queue stores module 15.
Function to each module is described in detail below.
Webpage and file download module 11: use HTTP, File Transfer Protocol to download webpage or file, and web pages downloaded is submitted to webpage parsing module 12, downloaded files is submitted to the directory system of search engine and set up index data base;
When crawler system 1 has just begun to start operation, the limit priority URL formation (its corresponding URL theme is divided into an acquiescence initial value) that some seed URL put into URL queue stores module 15 is set, some common navigating directory webpages for example, as www.hao123.com, webpage and file download module 11 obtain seed URL from the URL formation, download webpage then and send to webpage parsing module 12 and resolve.
Webpage parsing module 12: resolve html web page, extract the URL link that webpage comprises, and submit to url filtering module 13.
Url filtering module 13: judge whether each URL gathers,, judge whether to meet filter condition,, then send to acquisition control module 14 as URL to be collected if current URL does not gather and do not meet filter condition if do not gather;
In this url filtering module 13, preserve the URL historical record of having gathered; Judge according to the historical record of preserving whether download the URL that comprises in the webpage gathers, and the URL that will gather deposits in real time and writes down renewal in the historical record in;
In this url filtering module 13, all right stored filter condition, for example: the URL blacklist of filtercondition for setting, url filtering module 13 judges according to this filtercondition whether current URL is arranged in blacklist, if current URL is arranged in the blacklist of setting, judge that then this URL meets filter condition, this URL will be filtered, and not be sent to acquisition control module 14; Otherwise url filtering module 13 all sends to the URL that does not gather and do not meet filtercondition that is judged as that webpage parsing module 12 sends over acquisition control module 14 and handles.
Acquisition control module 14, the employing pre-defined algorithm calculates the theme branch of the URL of URL to be collected, determines the priority of corresponding URL according to the score value size of each URL theme branch; And be deposited into each URL in the different priorities formation of URL queue stores module 15 according to its corresponding priorities;
The concrete computing method that the URL theme divides are as follows:
In the formula (1), S (url) is the URL theme branch of this URL, and F (p) is the theme branch of webpage.Promptly the theme of a URL is divided into the theme branch sum of all webpages of having gathered that comprise this URL.
Wherein:
F (p)=a*numFileLink*FactorLink+b*numKeyWord*FactorWord formula (2)
In the formula (2), F (p) is the Web page subject branch of the webpage correspondence that comprises this URL that calculates;
The number of the subject document URL that numFileLink contains for this webpage;
FactorLink is the integrating factor of URL link;
The subject key words number that numKeyWord contains for this webpage;
The be the theme integrating factor of keyword of FactorWord;
A, b are weight factor, and a+b=1;
The theme that is to say a webpage divides relevant with subject document number that comprises and subject key words number, and it comprises that subject document is many more, and subject key words is many more, and then the theme of this webpage branch is big more.
URL queue stores module 15: preserve the URL formation of a plurality of different priorities, and divide big wisp URL to be collected to put into different priority queries according to the theme of URL; For example: preserve three formations, be respectively first priority query, second priority query and the 3rd priority query, URL divides size to be divided into three different intervals according to theme, wherein, first priority query's rank is the highest, and the storage theme divides maximum interval URL to be collected, second priority query takes second place, and the 3rd priority query's rank is minimum; Webpage and file download module 11 are at first gathered the URL in highest-ranking first priority query, have only after first priority query is for sky (because the URL that had gathered will delete from formation, if the URL in first priority query is gathered, then this formation will be sky), the URL in ability acquisition order second priority query and the 3rd priority query;
The URL formation number of storage can arbitrarily be provided with in this URL queue stores module 15, and the present invention does not limit this.
According to above-mentioned crawler system 1 provided by the invention, the invention provides a kind of subject document searching method, its idiographic flow comprises as shown in Figure 4:
Step S11, webpage parsing module analyzing web page and file download module web pages downloaded, and webpage resolved, extract the URL that webpage comprises, and send to the url filtering module;
Step S12, url filtering module judge whether current URL gathers, perhaps whether meets the filter condition needs and is filtered; Gathered or meet filter condition if judged result shows current URL, then abandoned this URL, flow process goes to step S11, continues to extract other URL that comprises in the webpage by the webpage parsing module; If judged result shows current URL and is not gathered or do not meet filter condition, then send this URL to acquisition control module, continue the following step;
Step S13, acquisition control module are gathered the URL theme branch that algorithm (as adopting above-mentioned formula (1), the defined specific algorithm of formula (2)) calculates this URL correspondence according to subject document;
Step S14, acquisition control module are determined the priority of this URL according to the corresponding relation of the URL theme branch of setting with priority, this URL are deposited in the corresponding priority query of URL queue stores module;
Step S15, webpage and file download module begin to read URL successively from high-priority queue and download; The network element of downloading is sent to the webpage parsing module handle, downloaded files is sent to the directory system of search engine.
In sum, the present invention downloads webpage by resolving, and extracts the URL that comprises in the webpage; Divide computing method to calculate the theme branch to each URL according to the URL theme, determine priority, put into different priority queries, preferentially gather the higher URL of priority, search for required subject document according to pre-defined rule; Because the URL that priority is higher and the relation of subject document are closer, the possibility that searches out the related subject file is bigger, therefore, adopts the present invention can improve search efficiency.
In addition, the present invention can accomplish to be not limited to some specific website at the enterprising line search of whole Internet, and search is fully satisfied user's needs comprehensively.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.