CN102355488B - Crawler seed obtaining method and equipment and crawler crawling method and equipment - Google Patents

Crawler seed obtaining method and equipment and crawler crawling method and equipment Download PDF

Info

Publication number
CN102355488B
CN102355488B CN201110232595.XA CN201110232595A CN102355488B CN 102355488 B CN102355488 B CN 102355488B CN 201110232595 A CN201110232595 A CN 201110232595A CN 102355488 B CN102355488 B CN 102355488B
Authority
CN
China
Prior art keywords
url
page
reptile
seed
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110232595.XA
Other languages
Chinese (zh)
Other versions
CN102355488A (en
Inventor
吴滨华
王祖海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Star Net Ruijie Networks Co Ltd
Original Assignee
Beijing Star Net Ruijie Networks Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Star Net Ruijie Networks Co Ltd filed Critical Beijing Star Net Ruijie Networks Co Ltd
Priority to CN201110232595.XA priority Critical patent/CN102355488B/en
Publication of CN102355488A publication Critical patent/CN102355488A/en
Application granted granted Critical
Publication of CN102355488B publication Critical patent/CN102355488B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a crawler seed obtaining method and equipment and a crawler crawling method and equipment. The crawler seed obtaining method comprises the following steps of: establishing a dynamic page request according to the preset search term dictionary and the URL (uniform resource locator) characteristics of a target navigation website; sending the dynamic page request to a server of the target navigation website; according to the preset extraction policy, extracting the target URL from the search result page returned by the server according to the dynamic page request, wherein the target URL is the main domain name address of the URL in the search result page; and performing unique processing on the target URL to obtain the unique target URL, wherein the unique target URL is used as a crawler seed. Through the technical scheme of the invention, abundant crawler seeds with great dispersion can be provided, and thus the time for forming the mainstream URL is shortened, the coverage of the mainstream URL is improved, and the time cost for crawling of the crawler system is reduced.

Description

Reptile seed acquisition methods and equipment and reptile crawling method and equipment
Technical field
The present invention relates to search engine technique, relate in particular to a kind of reptile seed acquisition methods and equipment and reptile crawling method and equipment.
Background technology
Search engine (search engine) refers to according to certain strategy, (Internet) gathers information from internet to use specific computer program, after information being organized and is processed, for user provides retrieval service, will show user's system with user-dependent result for retrieval.
At present, the crawl strategy of webpage can be divided into depth-first, breadth First and best preferential three kinds.Wherein, depth-first can cause (trapped) problem that is absorbed in of reptile under many circumstances, at present commonly breadth First and the best mode of priority.Breadth first search refers in crawl process, after the search that completes current level, just carries out the search of next level.The Design and implementation of this algorithm is relatively simple, and BFS method can cover webpage as much as possible.Also have at present a lot of research that breadth first search is applied in focused crawler (focused crawler is a kind of web crawlers program of " towards particular topic "), its basic thought is to think and initial URL(uniform resource locator) (Universal Resource Locator; Referred to as: URL) to have the probability of topic relativity very large for the webpage in certain link distance.Another method is that BFS is combined with home page filter technology, first with breadth-first strategy, captures webpage, wherein irrelevant home page filter is fallen.Best-first search strategy is according to certain web page analysis algorithm, the similarity of predicting candidate URL and target web, or with the correlation of theme, and choose and evaluate one or several best URL and capture.The webpage that it is " useful " that best-first search strategy is only accessed through web page analysis algorithm predicts, therefore best preference strategy is a kind of local optimum search algorithm.
Web crawlers is that the information of search engine crawls device.Reptile seed is URL or the URL set that web crawlers initialization crawls.Wherein, URL is a kind of identification method for the address of the upper webpage of complete description Internet and other resources, each webpage on Internet has a unique name identification, conventionally be referred to as URL address, this URL address can be local disk, can be also a certain computer on local area network (LAN), briefly, URL is exactly Web address, is commonly called as " network address ".In the prior art, reptile seed is by manually allocating in advance to web crawlers.Reptile seed is the information source of the follow-up URL of crawling of web crawlers.Concrete, web crawlers other URL address extraction that comprise in reptile kind subpage frame out, is put into URL queue to be crawled, as the follow-up object crawling; Along with the increase of the URL quantity crawling, reptile seed also just constantly changes and expands.
In prior art, because reptile seed is normally by artificial preassigned several URL, the screening of reptile seed is built to strategy or the scheme that there is no architecture, this has just caused in the situation that the whole network is searched for, time that need to be longer (being generally half a year or 1 year) can get a large amount of main flow URL, and because the spreadability of the limited formed main flow URL of reptile seed amount is also poor, concerning needs are realized the crawler system of the whole network mainstream data search fast, time cost is huge, is not easy to dispose implement.
Summary of the invention
The embodiment of the present invention provides a kind of reptile seed acquisition methods and equipment and the reptile method and apparatus that crawls, in order to reptile seed a large amount of, that dispersion is large to be provided, thereby shorten the time that forms main flow URL, improve the spreadability of main flow URL, time cost when reduction crawler system crawls.
The invention provides a kind of reptile seed acquisition methods, comprising:
According to the uniform resource position mark URL characteristic of default term dictionary and target navigation website, the request of structure dynamic page;
Described dynamic page request is sent to the server of described target navigation website;
According to default fetch strategy, the result for retrieval page returning according to described dynamic page request from described server, extract target URL, described target URL is the Main Domain address of the URL in the described result for retrieval page;
Described target URL is carried out to uniqueization processing, obtain uniqueization target URL, using described uniqueization target URL as reptile seed.
The invention provides a kind of reptile crawling method that uses the reptile seed that reptile seed acquisition methods provided by the invention obtains, comprising:
In seed queue in memory headroom, there is the described reptile kind period of the day from 11 p.m. to 1 a.m, from described seed queue, obtaining a reptile seed crawls, and the URL in the page crawling according to described reptile seed is added in the queue to be crawled of described memory headroom, and the reptile seed crawling is deleted from described seed queue;
In described seed queue, there is not described reptile seed, and described in crawling queue, exist when crawling URL, from queue described to be crawled, obtaining a URL to be crawled crawls, and described in the URL in the page crawling according to obtained URL to be crawled is added in queue to be crawled, and the URL having crawled is deleted from queue described to be crawled.
The invention provides a kind of reptile seed and obtain equipment, comprising:
Constructing module, for according to the uniform resource position mark URL characteristic of default term dictionary and target navigation website, constructs dynamic page request;
Sending module, for sending to described dynamic page request the server of described target navigation website;
Extraction module, for according to default fetch strategy, extracts target URL the result for retrieval page returning from described server according to described dynamic page request, and described target URL is the Main Domain address of the URL in the described result for retrieval page;
Acquisition module, for described target URL is carried out to uniqueization processing, obtains uniqueization target URL, using described uniqueization target URL as reptile seed.
The invention provides and a kind ofly use the reptile of the reptile seed that reptile seed acquisition methods provided by the invention obtains to crawl equipment, comprising:
First crawls module, for there is the described reptile kind period of the day from 11 p.m. to 1 a.m in the seed queue of memory headroom, from described seed queue, obtaining a reptile seed crawls, and the URL in the page crawling according to described reptile seed is added in the queue to be crawled of described memory headroom, and the reptile seed crawling is deleted from described seed queue;
Second crawls module, for there is not described reptile seed in described seed queue, and described in crawling queue, exist when crawling URL, from queue described to be crawled, obtaining a URL to be crawled crawls, and described in the URL in the page crawling according to obtained URL to be crawled is added in queue to be crawled, and the URL having crawled is deleted from queue described to be crawled.
Reptile seed acquisition methods of the present invention and equipment, reptile seed obtains equipment according to the URL characteristic of term dictionary and target navigation website, structure dynamic page request Concurrency is given server, the Main Domain address of initiatively extracting the URL that meets URL characteristic the searching page returning from server is as target URL, and target URL is carried out to uniqueization and process and to obtain reptile seed, with in prior art artificial specify reptile seed method compare, technical solution of the present invention can get a large amount of, the large URL of dispersion is as reptile seed, make obtaining of reptile seed realize architecture, and then reduced the time cost that obtains main flow URL based on reptile seed, improved the spreadability of the main flow URL obtaining, and improved the efficiency that reptile crawls based on reptile seed, reduced the time cost that reptile crawls.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the topological structure schematic diagram of the search engine of prior art;
The flow chart of the reptile seed acquisition methods that Fig. 2 one embodiment of the invention provides;
The flow chart of the reptile seed acquisition methods that Fig. 3 provides for another embodiment of the present invention;
The flow chart of the reptile crawling method that Fig. 4 provides for one embodiment of the invention;
The flow chart of the reptile crawling method that Fig. 5 A provides for another embodiment of the present invention;
The reptile that Fig. 5 B provides for another embodiment of the present invention crawls topological structure schematic diagram;
The reptile seed that Fig. 6 provides for one embodiment of the invention obtains the structural representation of equipment;
The reptile seed that Fig. 7 provides for another embodiment of the present invention obtains the structural representation of equipment;
The reptile that Fig. 8 provides for one embodiment of the invention crawls the structural representation of equipment;
The reptile that Fig. 9 provides for another embodiment of the present invention crawls the structural representation of equipment.
Embodiment
For making object, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
Fig. 1 is the topological structure schematic diagram of the search engine of prior art.3 servers shown in Figure 1, but be not limited to this.Wherein, first server, second server and the 3rd server can be respectively the server of Sina (sina) website, the server of the server of 163 websites and Yahoo (yahoo) website.Each server is mainly stored web page, for the request that crawls providing according to crawler server, to crawler server, returns and crawls results page.Crawler server, carries out crawling of web page according to some rule on the internet for being responsible for, and the content crawling (be each server return crawl results page) is saved in to this locality.Wherein, crawler server can be preserved the content crawling with textual form, also can preserve with the form of database.Information extractor, carries out information extraction and the processing such as peels off for the web page content that crawler server is crawled.Search engine text retrieval system, carries out information retrieval for the web information of the magnanimity extracting and separate in information extractor according to business demand, and result for retrieval is submitted to user.
In the topological structure of above-mentioned search engine, crawler server is only usingd artificial predefined a limited number of URL and is started to crawl as reptile seed, and the URL in the web page crawling by continuous extraction in crawling process expands the URL that will crawl.In existing scheme, reptile crawls along less path at the beginning, easily cause crawling path endless loop, and mainly by extract URL in crawling process, expand the URL that need to crawl the method speed of main flow URL be slow, time cost is higher finally to form, and then cause the efficiency of search engine lower, time cost cost is higher.For the problems referred to above, following examples of the present invention provide a kind of reptile seed while obtaining URL a large amount of main flows, that conform with target topic as initial crawling before crawler server starts to crawl, to improve the speed of crawling, to save the technical scheme that crawls time cost.
The flow chart of the reptile seed acquisition methods that Fig. 2 one embodiment of the invention provides.As shown in Figure 2, the method for the present embodiment comprises:
The URL characteristic of step 101, the default term dictionary of basis and target navigation website, the request of structure dynamic page.
The key problem of the present embodiment is how before crawler server starts to crawl, and obtains the URL of a large amount of discretizations as reptile seed.The present embodiment makes full use of the large-scale search engine database of the known websites such as Baidu (Baidu), Google (google), automatically build the URL that meets each known website, and the URL of structure is sent to the server of each the known website depth search of entirely standing, the result for retrieval page returning from each server, obtain a large amount of URL.
At present, for example, in most of known websites (Baidu, gooogle, Alexa), core data is all to the server of website, to send acquisition request by building dynamic URL.That is to say, the data output of most of known websites is input as condition with data.Wherein, entry condition is generally according to the dynamic URL of certain format structure.Conventionally, the form of the dynamic URL that every kind of website is supported is different.For example: the term " stock " of take is example, concerning different websites, " stock " position in URL, the coded format of using, display mode etc. are all not identical.In addition, except the form of URL, for different websites, how to control the result for retrieval receiving and carry out page turning, and how to control in each web page and show that how many data record etc. have different implementations.Wherein, the position of above mentioned term in URL, the coded format of using and to the control of page turning, control that shows number etc. is referred to as to the characteristic of website, and the position of term in URL, the information relevant to URL such as coded format used are called as the URL characteristic of website.For different web sites, its URL characteristic is generally not identical.Conventionally, does is the dynamic URL string format of google website: http://www.google.com.hk/search? q=%E8%82%A1%E7%A5%A8& Hl=zh-CN& Newwindow=1& Safe=strict& Prmd=ivnsub& Ei=upHTTZiZGIycvgPB7dS4DQ& Start=0& Sa=N; And the dynamic URL string format of Baidu website is: http://www.baidu.com/s? wd=%s& Pn=%d& Usm=3.The coding of the keyword after the actual coding conversion that wherein, the %s in each dynamic URL character string indicates to fill in; Pn=%d represents the index position of each initial URL, with Baidu, google etc. in order to distinguish user identity, carry out session (session) and follow the tracks of and be stored in data (cookie) on local terminal and arrange relevantly, determined that each page has how many searching record.For example: the term " stock " of take is example, does is URL corresponding while retrieving in Baidu website: http://www.baidu.com/s? wd=%B9%C9%C6%B1& Pn=0& Usm=3; Does is and corresponding URL while retrieving in google website: http://www.google.com.hk/search? q=%E8%82%A1%E7%A5%A8& Hl=zh-CN& Newwindow=1& Safe=strict& Prmd=ivnsub& Ei=upHTTZiZGIycvgPB7dS4DQ& Start=0& Sa=N.Wherein, concerning Baidu website, the key code that " stock " is corresponding is %B9%C9%C6%B1, and concerning google website, the key code that " stock " is corresponding is %E8%82%A1%E7%A5%A8.In addition, concerning Baidu website, the Pn=0 in the URL of " stock " correspondence represents that original position is from ID=0; And concerning google website, the start=10 in the URL of " stock " correspondence represents that original position is from ID=10.
Wherein, in order to adapt to the URL characteristic of different web sites, the reptile seed of the present embodiment obtains equipment and constructs dynamic URL according to the URL characteristic of each website.
In addition, in order to obtain as far as possible the discretization URL of larger amt, the reptile seed of the present embodiment obtains equipment and has preset and comprise the term dictionary of a large amount of terms and specified in advance required target navigation website.Wherein, term in term dictionary can be the arbitrary data that the target navigation website of artificial input can be identified, also can be the data in the valid data set that provides of target navigation website, can also be the data from the gradation dictionary of scieintific and technical dictionary or search engine.Wherein, the term that term dictionary comprises is The more the better, and the target navigation Websites quantity of appointment is The more the better.Wherein, target navigation website can comprise Baidu, google, Alexa, yahoo, sina etc.Based on this, the reptile seed of the present embodiment obtains equipment according to the term in default term dictionary and the URL characteristic of target navigation website, structure meets the dynamic page request of the URL characteristic of target navigation website, that is structure meets the dynamic URL of the URL characteristic of target navigation website.
Step 102, dynamic page request is sent to the server of target navigation website.
After constructing the dynamic page request that meets URL characteristic, reptile seed obtains equipment and the dynamic page request constructing is sent to the server of target navigation website, so that server is according to the dynamic page request depth search of entirely standing.
In specific implementation process, reptile seed obtains equipment can adopt HTML (Hypertext Markup Language) (HyperText Transfer Protocol; Referred to as: HTTP) agreement sends HTTP request to server, realizes the object that sends dynamic page request to server.Wherein, HTTP request comprises dynamic page request.In addition, reptile seed obtains equipment can also be in the mode of web browser control, and the access of real Simulation with I E browser to URL, sends URL access request to server, to realize the object that sends dynamic page request to server.Wherein, URL access request comprises dynamic page request.
Step 103, the default fetch strategy of basis extract target URL from the server result for retrieval page that request is returned according to dynamic page.
When server receives after dynamic page request, can be according to the dynamic page request depth search of entirely standing, and can obtain equipment to reptile seed and return to the result for retrieval page.This result for retrieval page is generally the web page that carries result for retrieval; And depending on how many differences of result for retrieval, this result for retrieval page can be that a web page can be also a plurality of web pages.
Reptile seed obtains equipment and receives after the result for retrieval page that server returns, and according to default fetch strategy, extracts target URL from the result for retrieval page.Wherein, target URL mainly refers to the Main Domain address in the URL of the URL characteristic that meets this target navigation website.Because the URL characteristic of different web sites is different, and because the quantity difference of result for retrieval corresponding to different terms causes the quantity of the result for retrieval page different, therefore the reptile seed of the present embodiment obtains equipment in conjunction with actual conditions, preset the strategy that extracts target URL.Wherein, fetch strategy mainly comprise judge the result for retrieval page whether also have lower one page operation, judge whether to meet the operation of the URL characteristic of target navigation website, and the respective handling operation under various judged results etc.
For a URL, conventionally by Main Domain address (domain) and leaf architecture.For example: establishing the URL on Baidu website that term " stock " is corresponding is: http://www.baidu.com/s? wd=%B9%C9%C6%B 1& Pn=0& Usm=3, the Main Domain address in this URL is www.baidu.com, other parts are leaf.Again for example: be located in the result for retrieval page and get the target URL that term " stock " is corresponding, be specially: http://news.baidu.com/ns? cl=2& Rn=20& Tn=news& Word=%B9%C9%C6%B1, wherein " news.baidu.com " is Main Domain address, other parts are leaf.
In actual applications; for same term; conventionally can in the result for retrieval page, get the identical a plurality of target URL in Main Domain address, for example www.sina.com.cn/sports is two identical target URL of Main Domain address with www.sina.com.cn/news.And for the identical target URL in these Main Domain addresses, by a Main Domain address, just web page corresponding to these targets URL all can be crawled; If each target URL identical for Main Domain crawls, it crawls and comes to the same thing, this by increase crawl number of times, reduce crawl efficiency, waste crawls the time.Therefore in the present embodiment, reptile seed obtains equipment and extracts Main Domain address in URL according to said extracted strategy as target URL from the result for retrieval page.
Step 104, target URL is carried out to uniqueization processing, obtain uniqueization target URL, using uniqueization target URL as reptile seed.
When reptile seed obtains equipment, from the result for retrieval page, extract after target URL, dispersion for the reptile seed that guarantees finally to obtain, under the wide as far as possible prerequisite of the coverage that makes reptile seed, make each reptile seed discretization as far as possible, avoid that how invalid target URL is loaded into memory headroom and crawl, the efficiency while crawling retrieval to improve based on reptile seed.
For achieving the above object, the reptile seed of the present embodiment obtains equipment target URL is carried out to uniqueization processing, obtains uniqueization target URL.Concrete, reptile seed obtains equipment each target URL is compared, and for the target URL repeating, only retains one, and this target URL being retained is regarded as uniqueization target URL.Reptile seed obtains equipment using the uniqueization target URL obtaining as reptile seed.
The reptile seed acquisition methods of the present embodiment, reptile seed obtains equipment according to the URL characteristic of term dictionary and target navigation website, structure dynamic page request Concurrency is given server, the Main Domain address of initiatively extracting the URL that meets URL characteristic the searching page returning from server is as target URL, and target URL carried out to uniqueization and process and obtain reptile seed.Compare with the artificial method of reptile seed of specifying in prior art, the present embodiment can get reptile seed a large amount of, that dispersion is large, make obtaining of reptile seed realize architecture, like this when crawling to obtain main flow URL based on reptile seed, can improve and obtain the efficiency of main flow URL, the time cost that main flow URL is obtained in reduction, and can improve the spreadability of the main flow URL obtaining, meanwhile, also for improving the speed that crawls operation and efficiency, the saving carried out based on reptile seed, crawl time cost and lay the foundation.
Term dictionary in various embodiments of the present invention generally includes a large amount of terms, the corresponding dynamic page request of each term, that is to say, it can basis be a dynamic page request of each term structure according to the URL characteristic of target navigation website that reptile seed obtains equipment.The execution mode of the step 101 that the present embodiment provides, comprising:
Step 1011, reptile seed obtain equipment and first all terms in term dictionary are loaded in its memory headroom.
Under normal circumstances, term dictionary is kept in external memory space.In order to improve execution speed and efficiency, reptile seed obtains equipment and in advance all terms in term dictionary is loaded in its memory headroom.
Step 1012, reptile seed obtain equipment and judge in its memory headroom whether also have term to exist; If judgment result is that to be, perform step 1013; If the determination result is NO, perform step 1016.
In the present embodiment, it is every according to a term generation dynamic page request and after getting the result for retrieval page that reptile seed obtains equipment, just this term deleted from memory headroom.Therefore obtaining equipment, reptile seed need to judge in memory headroom whether also have term to exist, to judge whether all to have carried out search operaqtion according to all terms.
Wherein, whether reptile seed obtains equipment and can initiatively go to audit memory space to also have term to exist according to certain frequency, also can be on receiving go to audit memory space whether to also have term to exist after the result for retrieval page corresponding to term, process again another term after handling a term.Because reptile seed goes to audit memory space whether to also have term to exist after obtaining the result for retrieval page that equipment term on receiving is corresponding, can avoid the repeated obtain to same term, therefore be a kind of preferred implementation.
Step 1013, reptile seed obtain equipment and obtain a term, and perform step 1014.
While also having term to exist in memory headroom, reptile seed obtains equipment and just therefrom obtains a term.
Step 1014, reptile seed obtain equipment according to the URL characteristic of target navigation website, use the hexadecimal code that obtained term is corresponding to construct dynamic URL, to form dynamic page request, and turn and perform step 1015.
Conventionally, the URL of each website represents with hexadecimal form, therefore after getting term, reptile seed obtains equipment according to the URL characteristic of target navigation website, with hexadecimal code corresponding to obtained term, construct dynamic URL, to form dynamic page request.First, reptile seed obtains equipment needs clear and definite this less important target navigation website of retrieving, for example, be Baidu or google, and then the URL characteristic of definite target navigation website.Here the characteristic of URL mainly refers to the basic format of URL character string.
Wherein, the coded format that most of websites are used is GB2312, such as Baidu, yahoo etc., and that as the URL of some special websites such as google, use is UTF-8.In order to adapt to the demand of most of websites, the term in the term dictionary of the present embodiment adopts GB2312 coded format to encode.And for the website as use UTF-8 forms such as google, reptile seed obtains equipment in advance using these websites as special web site stores.Based on this, in said process, reptile seed obtains equipment need to judge whether target navigation website is the special website of presetting.If judgment result is that, be, be that target navigation website is while being special website, it is the coding of UTF-8 form by the code conversion of GB2312 form corresponding to the term of getting that reptile seed obtains equipment, the binary value of the coding of UTF-8 form is converted to hexadecimal value, with the hexadecimal code structure of changing out, meet the dynamic URL of the URL characteristic of special website, thereby form dynamic page request.If the determination result is NO, be that target navigation website is while being not special website, reptile seed obtains equipment the binary value of the coding of GB2312 form corresponding to obtained term is converted to hexadecimal value, with the hexadecimal code structure of changing out, meet the dynamic URL of the URL characteristic of target navigation website, thereby form dynamic page request.
Step 1015, the dynamic page request forming according to obtained term when server are returned after the result for retrieval page, and reptile seed obtains equipment obtained term is deleted from memory headroom, and turn and perform step 1012.
In the present embodiment, it is main that how orderly according to the process of each term structure dynamic page request what describe be, therefore construct in step 1014 after dynamic page request, turn and perform step 1015, the term that is about to process is deleted from memory headroom, one side release takies memory headroom, guarantees that each term is processed on the other hand by which.In this process, if fail to receive the result for retrieval page that server returns after sending dynamic page request, can attempt repeatedly sending, if attempting repeatedly still failing to receive the result for retrieval page that server returns after transmission, just this term is recorded in a text to treat follow-up processing, this term is deleted from memory headroom simultaneously and next term is processed.
And obtain flow process in conjunction with whole reptile seed, this step 1014 also needs to turn to step 102 simultaneously, often according to term, constructs a dynamic page request and just sends it to server, the depth search so that server is stood entirely.Wherein, when reptile seed obtains equipment and whether has term and obtain term when existing according to certain frequency active inquiry memory headroom, reptile seed obtains the process that equipment is used the request of current term structure dynamic page, and the process of retrieving and return the result for retrieval page according to the dynamic page request that term is corresponding before with server is executed in parallel.When whether reptile seed obtains while going to audit memory space to also have term exist and obtain term when existing after equipment is receiving the result for retrieval page that term is corresponding, reptile seed obtain equipment only server according on after a dynamic page request corresponding to term retrieves and returns the result for retrieval page, could use current term to construct dynamic page request.
In addition, reptile seed obtains equipment and also can all construct after dynamic page request according to all terms, and all dynamic page requests are sent to server together, after step 1016, turns and performs step 102.
Step 1016, reptile seed obtain the operation of device end structure dynamic page request.
In the present embodiment, reptile seed obtains equipment and in advance all terms in term dictionary is loaded in memory headroom, respectively according to each term structure dynamic page request, and after receiving the result for retrieval page that server returns, corresponding term is deleted, the efficiency that has improved on the one hand the request of structure dynamic page, the time of obtaining reptile seed for saving lays the foundation; Can also discharge in time on the other hand the taking of memory headroom, be conducive to improve the utilance of memory headroom, improve the processing speed that reptile seed obtains equipment.
On the basis of above-described embodiment, the present embodiment provides a kind of execution mode of step 103, comprising:
Step 1031, reptile seed obtain the current result for retrieval page that equipment reception server returns, and extract the Main Domain address of the URL in the current result for retrieval page as target URL, and the target URL getting is added in uniqueization queue.
Wherein, the result for retrieval page that server returns mainly refers to text mark language (Hypertext Markup Language; Referred to as: the HTML) web page of form.Reptile seed obtains equipment and first from main body (body) part of web page, carries out the extraction of external linkage address, extracts the URL that meets URL characteristic; And then the Main Domain address of extracting URL is as target URL, afterwards the target URL getting is added in uniqueization queue, to prepare that target URL is carried out to uniqueization processing.Wherein, uniqueization queue is preferably the storage queue that reptile seed obtains the memory headroom realization of equipment.
Step 1032, reptile seed obtain equipment and whether also have lower one page according to the default current result for retrieval page of searching page threshold decision; When judgment result is that, be that the current result for retrieval page also has lower one page, performs step 1033; When the determination result is NO, when the current result for retrieval page does not exist lower one page, perform step 1034.
In the present embodiment, for URL characteristic and the different term of different web sites, reptile seed obtains equipment and has set in advance searching page threshold value, thereby limits the number of pages of the result for retrieval page that need to return for each term server.For example: when searching page threshold value setting is 10, server need to return to 10 pages of result for retrieval pages.
Concrete, reptile seed obtains the quantity that equipment can record the result for retrieval page that server returned, and can be in real time and searching page threshold value compare; When reaching searching page threshold value, just next term is processed.Wherein, when reptile seed obtains that equipment is every to be generated a dynamic page request and dynamic page request is sent to server according to term, next term is processed and referred to according to next term and generate next dynamic page request Concurrency give server so that server is stood full depth search and subsequent treatment.When reptile seed obtains equipment and dynamic page request corresponding to all terms sent to server simultaneously, next term is processed and referred to the processing that the result for retrieval page corresponding to next term extracted to target URL.
Step 1033, reptile seed obtain equipment and obtain lower one page page request, and lower one page page request is sent to server, and return to execution step 1031.
When the result for retrieval page corresponding to certain term also has lower one page, reptile seed obtains equipment and obtains lower one page page request, and lower one page page request is sent to server, to obtain the result for retrieval of lower one page.After obtaining the result for retrieval of lower one page, perform step 1031 operation the result for retrieval of lower one page is carried out to the extraction process of target URL, until the current result for retrieval page does not descend one page.
In actual application, on the page of number of site, can be provided with lower one page button click.Therefore for the result for retrieval page with lower one page button click, reptile seed obtains equipment can obtain lower one page button click from the current result for retrieval page, click this lower one page button click to send lower one page page request to server.For the result for retrieval page without lower one page button click, reptile seed obtain equipment can according to the page rule of term corresponding to dynamic page request and the current result for retrieval page (such as: in the result for retrieval number comprising, URL No. ID of initial demonstration etc.), lower dynamic URL corresponding to one page retrieval result page face of structure, and dynamic URL corresponding to lower one page retrieval result page face sent to server, to ask the result for retrieval of lower one page.
Step 1034, reptile seed obtain equipment the term corresponding with retrieval result page face are deleted from memory headroom, and finish this time extraction operation to target URL.
In the present embodiment, reptile seed obtains equipment to carry out uniqueization to target URL and processes the condition of providing convenience by target URL being stored in uniqueization queue as follow-up, and the memory headroom that uses reptile seed to obtain equipment can greatly improve as uniqueization queue the speed that uniqueization processed, for saving, obtain the time cost of reptile seed and made contribution.
Based on above-described embodiment, the present embodiment provides a kind of execution mode of step 104, comprising:
Step 1041, reptile seed obtain equipment and judge in uniqueization queue, whether also there is target URL; When judgment result is that, be, while also there is target URL in i.e. uniqueization queue, to perform step 1042; When the determination result is NO, i.e., while there is not target URL in uniqueization queue, perform step 1046.
In the present embodiment, reptile seed obtains every couple of target URL of equipment to carry out after uniqueization processing, it is deleted from uniqueization queue, therefore reptile seed obtains equipment, by judging whether also to exist target URL to know in uniqueization queue whether all target URL to have been carried out to uniqueization, process.
Step 1042, reptile seed obtain equipment and from uniqueization queue, obtain a target URL, and perform step 1043.
While there is target URL in uniqueization queue, reptile seed obtains equipment and from uniqueization queue, obtains a target URL.
Whether Already in step 1043, reptile seed obtain target URL that equipment judgement obtains in reptile seed list; When judgment result is that, be that the target URL that obtained Already in reptile seed list time, performs step 1045; When the determination result is NO, perform step 1044.
In the present embodiment, reptile seed obtains equipment and stores uniqueization target URL by reptile seed list, and wherein reptile seed list is preferably the external memory space realization that reptile seed obtains equipment.
Concrete, after the target URL getting, whether Already in reptile seed obtains target URL that equipment judgement obtains in reptile seed list, to judge whether to exist the target URL identical with the target URL being obtained.
Step 1044, reptile seed obtain equipment obtained target URL are stored in reptile seed list, and perform step 1045.
Step 1045, obtained target URL is deleted from uniqueization queue, and turn and perform step 1041.
When obtained target URL does not exist in reptile seed list, illustrate that this target URL occurs for the first time, therefore reptile seed obtains equipment obtained target URL is stored in reptile seed list, and obtained target URL is deleted from uniqueization queue, continue execution step 1041 and subsequent operation.
When obtained target URL is present in reptile seed list, illustrate that this target URL occurs for the first time, therefore reptile seed obtains equipment and no longer obtained target URL is stored in reptile seed list, and directly obtained target URL is deleted from uniqueization queue, and continue execution step 1041 and subsequent operation.
Step 1046, reptile seed obtain equipment using the target URL storing in reptile seed list as reptile seed.
While there is not target URL in uniqueization queue, illustrate that all target URL having been carried out to uniqueization processes, now, reptile seed obtains equipment using the target URL storing in reptile seed list as reptile seed.
In the present embodiment, reptile seed obtains equipment each target URL is carried out to uniqueization processing, has simple advantage easy to implement, and by the target URL processing through uniqueization is deleted from uniqueization queue, can discharge in time shared memory headroom, improve the utilance of memory headroom.
The flow chart of the reptile seed acquisition methods that Fig. 3 provides for another embodiment of the present invention.As shown in Figure 3, the method for the present embodiment comprises:
Step 301, reptile seed obtain equipment and will preset term dictionary and be loaded in memory headroom.
Step 302, reptile seed obtain the URL characteristic that equipment is determined target navigation website and target navigation website.
In this step, URL characteristic mainly refers to the form of URL character string.
Step 303, reptile seed obtain equipment and judge in memory headroom, whether there is term; If judgment result is that to be, in memory headroom, there is term, perform step 304; Otherwise, perform step 315.
Step 304, reptile seed obtain equipment and obtain a term, and perform step 305.
Step 305, reptile seed obtain equipment and judge whether target navigation website is the special website of presetting; If judgment result is that to be, perform step 306; If the determination result is NO, perform step 307.
Wherein, default special website mainly refers to the website of using UTF-8 coded format, for example google website.
Step 306, reptile seed obtain equipment term are converted to UTF-8 coded format by GB3212 coded format, and perform step 307.
Step 307, reptile seed obtain equipment the binary value of the term of UTF-8 coded format are converted to hexadecimal value, and show, perform step afterwards 308.
Step 308, reptile seed obtain equipment according to the dynamic URL of hexadecimal term code construction, and send to server.
Step 309, reptile seed obtain equipment and obtain the html page that carries result for retrieval (being the result for retrieval page) that server returns.
Step 310, reptile seed obtain device parses html page, obtain the external linkage address that meets URL characteristic, i.e. URL.
Step 311, reptile seed obtain equipment the URL obtaining are formatd, and extract the Main Domain address in URL, and the Main Domain address of extraction is sent in uniqueization queue as target URL.
Step 312, reptile seed obtain equipment and whether also have lower one page according to default searching page threshold decision html page; If judgment result is that to be, return to execution step 308; If the determination result is NO, perform step 313.
Step 313, reptile seed obtain equipment the Main Domain address in uniqueization queue are outputed in local reptile seed list, and uniqueization queue is emptied.
Step 314, reptile seed obtain equipment current term are deleted from memory headroom, and return to execution step 303.
Step 315, reptile seed obtain equipment uniqueization processing are carried out in the Main Domain address in reptile seed list, and end operation.
In the present embodiment, be after all Main Domains address getting is delivered in reptile seed list, more uniqueization processing is carried out in Main Domain address, but be not limited to this.For example: reptile seed obtains equipment can also carry out uniqueization processing to Main Domain address in the process that Main Domain address is outputed to from uniqueization queue to reptile seed list.
Uniqueization of the present embodiment processed and referred to for a plurality of identical Main Domain addresses and only retain one in reptile seed list, and by other deletion.
The reptile seed acquisition methods of the present embodiment, reptile seed obtains equipment according to the URL characteristic of term dictionary and target navigation website, structure dynamic page request Concurrency is given server, the Main Domain address of initiatively extracting the URL that meets URL characteristic the searching page returning from server is as target URL, and target URL carried out to uniqueization and process and obtain reptile seed.Compare with the artificial method of reptile seed of specifying in prior art, the present embodiment can get reptile seed a large amount of, that discretization degree is large, make obtaining of reptile seed realize architecture, and then reduced the time cost that obtains main flow URL based on reptile seed, improved the spreadability of the main flow URL obtaining, improve efficiency when reptile crawls based on reptile seed simultaneously, reduced the time cost that reptile crawls.
The flow chart of the reptile crawling method that Fig. 4 provides for one embodiment of the invention.As shown in Figure 4, the method for the present embodiment comprises:
Step 400, reptile crawl equipment and judge in the seed queue in its memory headroom whether have reptile seed; If judgment result is that to be, perform step 401; If the determination result is NO, perform step 402.
Wherein, the reptile seed of the present embodiment is the reptile seed that the reptile seed acquisition methods that provided by the various embodiments described above gets.Described reptile seed is loaded in the memory headroom that reptile crawls equipment, for reptile provides the original URL crawling.
In the present embodiment, reptile crawls in the seed queue that equipment is loaded into reptile seed memory headroom can improve the speed of obtaining reptile seed in the process of crawling, and then improves and crawl efficiency, is conducive to the cost that saves time.
Step 401, reptile crawl equipment and from seed queue, obtain a reptile seed and crawl, and the URL in the page crawling according to reptile seed is added in the queue to be crawled of memory headroom, and the reptile seed crawling is deleted from seed queue.
In seed queue in memory headroom, have the reptile kind period of the day from 11 p.m. to 1 a.m, reptile crawls equipment and obtains a reptile seed, and take reptile seed and this time crawl as original URL.When crawling the page, reptile crawls equipment and resolves crawling the page, therefrom obtains the URL of external linkage; When getting the URL of external linkage, the URL getting is added in the queue to be crawled in memory headroom.In the present embodiment, in order to guarantee that reptile seed is preferentially crawled, therefore seed queue and queue to be crawled are set in memory headroom simultaneously, in seed queue, store reptile seed, and in queue to be crawled, be stored in the URL obtaining in the process of crawling, after only having reptile seed in the seed queue to be all crawled, could start to treat crawling crawling of URL in queue.Meanwhile, reptile crawls equipment the reptile seed after being crawled is deleted from seed queue, on the one hand prevents from same reptile seed to repeat to crawl, and releasing memory space in time on the other hand, improves the utilance of memory headroom.
In addition, for the page crawling, reptile crawls equipment and also sends it to follow-up information extractor, is carried out the extraction of relevant information by information extractor to crawling the page.
Step 402, reptile crawl equipment and judge in queue to be crawled, whether there is URL to be crawled; When judgment result is that while being, perform step 403; When the determination result is NO, perform step 404.
In seed queue, do not have reptile seed, reptile crawls equipment and judges the URL that whether needs to be crawled in queue to be crawled; If had, continue to treat the URL crawling in queue and crawl; If no, finish to crawl operation.
Step 403, reptile crawl equipment and from queue to be crawled, obtain a URL to be crawled and crawl, and the URL in the page crawling according to obtained URL to be crawled is added in queue to be crawled, and the URL having crawled is deleted from queue to be crawled.
In seed queue, there is not reptile seed, and in crawling queue, exist when crawling URL, reptile crawls equipment and from queue to be crawled, obtains a URL to be crawled and crawl, and the page crawling is analyzed, from crawl the page, obtain the URL of external linkage, and the URL obtaining is added in queue to be crawled and to be crawled proceeding.Simultaneously, reptile crawls equipment the URL crawling is deleted from queue to be crawled, and prevents from same URL to repeat to crawl on the one hand, on the other hand timely releasing memory space, for the follow-up new URL crawling provides memory space, improve the utilance of memory headroom.
In like manner, for the page crawling, reptile crawls equipment and also sends it to follow-up information extractor, is carried out the extraction of relevant information by information extractor to crawling the page.
Step 404, end crawl operation.
In seed queue, there is not reptile seed, and do not exist when crawling URL in crawling queue yet, do not have URL to crawl, therefore the process that crawls finishes.
The reptile crawling method of the present embodiment, the reptile seed that the above-described embodiment of usining obtains is as the original URL crawling, and preferentially crawl reptile seed, only need after all reptile seeds all crawl, crawl again other URL, by preferentially crawling reptile seed, the operational efficiency of the reptile of Optimizing Search engine, there is in the short period of time object to complete crawling of mass data, to the rapid deployment such as vertical search engine, can shorten greatly the data time of web crawlers, raising system is disposed convenience and practicality.
In the above-described embodiments, before step 400, reptile crawls equipment and can in advance the reptile seed in reptile seed list be all loaded in the seed queue in memory headroom, for step 400 and subsequent step lay the first stone.In this embodiment, reptile seed acquisition process completed before reptile crawls.
Except aforesaid way, reptile seed acquisition process can also crawl concurrent process with reptile and carry out, in reptile seed acquisition process, reptile crawls equipment timer access reptile seed list, the reptile seed not being loaded in reptile seed list is loaded in the seed queue of its memory headroom, and be that the reptile seed being loaded in seed queue in reptile seed list arranges access identities, thereby realization is carried out reptile and is crawled operation in reptile seed acquisition process, and has guaranteed that reptile seed is preferentially crawled.In addition, in this embodiment, by the reptile seed for being loaded, access identities is set, has realized the identification to reptile seed, avoided same reptile seed by repeated loading, for raising crawls efficiency, lay the foundation.
The flow chart of the reptile crawling method that Fig. 5 A provides for another embodiment of the present invention; The reptile that Fig. 5 B provides for another embodiment of the present invention crawls topological structure schematic diagram.The topological structure that crawls shown in Fig. 5 B is compared with the topology shown in Fig. 1, and difference is to be provided with before crawler server the reptile seed server that reptile seed obtains equipment and a storage reptile seed; In addition, the reptile that crawls crawler server in topology shown in Fig. 5 B and be in embodiment illustrated in fig. 4 crawls equipment, has the function that preferentially crawls reptile seed.As shown in Figure 5A, the method for the present embodiment comprises:
Step 501, reptile seed obtain equipment and obtain reptile seed, and reptile seed is stored in reptile seed database.
Reptile seed server in this step is equivalent to the function of the reptile seed list in above-described embodiment.
The specific implementation process of this step 501 can be referring to Fig. 2 or detailed description embodiment illustrated in fig. 3.
Step 502, crawler server regularly extract the reptile seed that there is no access identities from reptile seed server, for the reptile seed extracting arranges access identities, and the reptile seed of extraction are inserted in the seed queue of its memory headroom.
Step 503, crawler server preferentially crawl the reptile seed in seed queue, there is no the reptile kind period of the day from 11 p.m. to 1 a.m in seed queue, then crawl the external linkage URL extracting in the html page crawling.
The reptile crawling method of the present embodiment, reptile seed obtains equipment combining target navigation website and constructs dynamic URL, reach the URL that excavates target navigation website profound level, to obtain the object of a large amount of reptile seeds discrete, that optimize, and then make crawler server by preferentially crawling reptile seed, the operational efficiency of the reptile of Optimizing Search engine, has object to complete crawling of mass data in the short period of time; Especially to the rapid deployment as vertical search engine, can shorten greatly the data time of web crawlers, raising system is disposed convenience and practicality.
The reptile seed that Fig. 6 provides for one embodiment of the invention obtains the structural representation of equipment.As shown in Figure 6, the reptile seed of the present embodiment obtains equipment and comprises: constructing module 61, sending module 62, extraction module 63 and acquisition module 64.
Wherein, constructing module 61, for according to the URL characteristic of default term dictionary and target navigation website, constructs dynamic page request.Sending module 62, is connected with server with constructing module 61, for dynamic page request being sent to the server of target navigation website.Extraction module 63, is connected with server, for according to default fetch strategy, from the server result for retrieval page that request is returned according to dynamic page, extracts target URL, and described target URL is the Main Domain address of the URL in the result for retrieval page.Acquisition module 64, is connected with extraction module 63, for target URL is carried out to uniqueization processing, obtains uniqueization target URL, using uniqueization target URL as reptile seed.
The present embodiment reptile seed obtains each functional module of equipment can be by the flow process of reptile seed acquisition methods shown in execution graph 2, and its specific works principle repeats no more, and refers to the description of embodiment of the method.
The reptile seed of the present embodiment obtains equipment, according to the URL characteristic of term dictionary and target navigation website, structure dynamic page request Concurrency is given server, the Main Domain address of initiatively extracting the URL that meets URL characteristic the searching page returning from server is as target URL, and target URL carried out to uniqueization and process and obtain reptile seed.Compare with the artificial method of reptile seed of specifying in prior art, the reptile seed of the present embodiment obtains equipment and can get a large amount of, the reptile seed that dispersion is large, make obtaining of reptile seed realize architecture, like this when crawling to obtain main flow URL based on reptile seed, can improve the efficiency of obtaining main flow URL, reduce the time cost that obtains main flow URL, and can improve the spreadability of the main flow URL obtaining, simultaneously, also for improving the speed that crawls operation and the efficiency of carrying out based on reptile seed, saving crawls time cost and lays the foundation.
The reptile seed that Fig. 7 provides for another embodiment of the present invention obtains the structural representation of equipment.The present embodiment is based on realization embodiment illustrated in fig. 6, and as shown in Figure 7, the constructing module 61 that the present embodiment reptile seed obtains equipment comprises: loading unit 611, term acquiring unit 612 and structural unit 613.
In specific implementation process, loading unit 611 is loaded into by all terms in term dictionary the memory headroom that reptile seed obtains equipment.Term acquiring unit 612 judges in memory headroom, whether there is term, and while having term to exist in memory headroom, obtains a term.Structural unit 613 is according to the URL characteristic of target navigation website, and hexadecimal code corresponding to term obtaining with term acquiring unit 612 constructed dynamic URL, to form dynamic page request.
Further, the structural unit 613 of the present embodiment is specifically for judging whether target navigation website is the special website of presetting; Wherein special website mainly refers to the website of using UTF-8 coded format, such as google website etc.When target navigation website is special website, structural unit 613 is the coding of UTF-8 form by the code conversion of GB2312 form corresponding to obtained term, the binary value of the coding of UTF-8 form is converted to hexadecimal value, the dynamic URL that meets the URL characteristic of special website with the hexadecimal code structure of changing out, to form dynamic page request.When target navigation website is not special website, structural unit 613 is converted to hexadecimal value by the binary value of the coding of GB2312 form corresponding to obtained term, the dynamic URL that meets the URL characteristic of target navigation website with the hexadecimal code structure of changing out, to form dynamic page request.
Above-mentioned each functional unit can be used for carrying out the flow process of the embodiment of step 101 in said method embodiment, and its specific works principle repeats no more, and refers to the description of embodiment of the method.
The sending module 62 of the present embodiment comprises following arbitrary transmitting element or its combination.The first transmitting element 621, for send HTTP request to server, this HTTP request comprises the dynamic page request that structural unit 613 forms.The second transmitting element 622, for sending URL access request to server, this URL access request comprises the dynamic page request that structural unit 613 forms.
Above-mentioned each functional unit can be used for carrying out the flow process of the embodiment of step 102 in said method embodiment, and its specific works principle repeats no more, and refers to the description of embodiment of the method.
The extraction module 63 of the present embodiment comprises: receive acquiring unit 631, judgement trigger element 632 and delete cells 633.
In specific implementation process, receiving acquiring unit 631 is connected with server, the current result for retrieval page returning for reception server, extracts the Main Domain address of the URL in the current result for retrieval page as target URL, and the target URL obtaining is added in uniqueization queue.Judgement trigger element 632 is connected with server with reception acquiring unit 631, for whether also have lower one page according to the default current result for retrieval page of searching page threshold decision, and when the current result for retrieval page also has lower one page, obtain lower one page page request, and lower one page page request is sent to server, and trigger reception acquiring unit 631 and carry out the current result for retrieval page that reception servers return, extract the Main Domain address of the URL in the current result for retrieval page as target URL, and target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page.Delete cells 633, be connected with judgement trigger element 632, for at judgement trigger element 632, judge the current result for retrieval page not in the presence of during one page, the term corresponding with retrieval result page face deleted from memory headroom, for term acquiring unit 612 obtains term, provide condition.
More specifically, judgement trigger element 632 can obtain lower one page button click from the current result for retrieval page, click lower one page button click to send lower one page page request to server, then trigger and receive the current result for retrieval page that acquiring unit 631 execution reception servers return, extract the Main Domain address of the URL in the current result for retrieval page as target URL, and target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page.
In addition, judgement trigger element 632 can also be according to the page rule of term corresponding to dynamic page request and the current result for retrieval page, lower dynamic URL corresponding to one page retrieval result page face of structure, and dynamic URL corresponding to lower one page retrieval result page face sent to server, then trigger and receive the current result for retrieval page that acquiring unit 631 execution reception servers return, extract the Main Domain address of the URL in the current result for retrieval page as target URL, and target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page.
Above-mentioned each functional unit can be used for carrying out the flow process of the embodiment of step 103 in said method embodiment, and its specific works principle repeats no more, and refers to the description of embodiment of the method.
The acquisition module 64 of the present embodiment comprises: extraction unit 641, judging unit 642, storage judging unit 643 and seed acquiring unit 644.
Wherein, extraction unit 641, is connected with reception acquiring unit 631, for when uniqueization queue exists target URL, obtains a target URL from uniqueization queue.Judging unit 642, be connected with extraction unit 641, for the target URL that obtains at extraction unit 641 Already in during reptile seed list, obtained target URL is deleted from uniqueization queue, and judge in uniqueization queue, whether also there is target URL, and judgment result is that triggering extraction unit 641 while existing carries out and extract operation, when the determination result is NO, trigger seed acquiring unit 644 and carry out and obtain operation.Storage judging unit 643, be connected with extraction unit 641, while not being present in reptile seed list for the target URL obtaining at extraction unit 641, obtained target URL is stored in reptile seed list, obtained target URL is deleted from uniqueization queue, and judge in uniqueization queue, whether also there is target URL, and judgment result is that triggering extraction unit 641 while existing carries out and extract operation, when the determination result is NO, trigger 644 execution of seed acquiring unit and obtain operation.Seed acquiring unit 644, is connected with storage judging unit 643 with judging unit 642 respectively, for when uniqueization queue does not exist target URL, using the target URL storing in reptile seed list as reptile seed.
Above-mentioned each functional unit can be used for carrying out the flow process of the embodiment of step 104 in said method embodiment, and its specific works principle repeats no more, and refers to the description of embodiment of the method.
The reptile seed of the present embodiment obtains equipment, according to the URL characteristic of term dictionary and target navigation website, structure dynamic page request Concurrency is given server, the Main Domain address of initiatively extracting the URL that meets URL characteristic the searching page returning from server is as target URL, and target URL carried out to uniqueization and process and obtain reptile seed.Compare with the artificial method of reptile seed of specifying in prior art, the reptile seed of the present embodiment obtains equipment and can get a large amount of, the reptile seed that dispersion is large, make obtaining of reptile seed realize architecture, like this when crawling to obtain main flow URL based on reptile seed, can improve the efficiency of obtaining main flow URL, reduce the time cost that obtains main flow URL, and can improve the spreadability of the main flow URL obtaining, simultaneously, also for improving the speed that crawls operation and the efficiency of carrying out based on reptile seed, saving crawls time cost and lays the foundation.
The reptile that Fig. 8 provides for one embodiment of the invention crawls the structural representation of equipment.As shown in Figure 8, the reptile of the present embodiment crawls equipment and comprises: first crawls module 81 and second crawls module 82.
Wherein, first crawls module 81, the memory headroom that crawls equipment with reptile is connected, for there is the reptile kind period of the day from 11 p.m. to 1 a.m in the seed queue of memory headroom, from seed queue, obtaining a reptile seed crawls, and the URL in the page crawling according to reptile seed is added in the queue to be crawled of memory headroom, and the reptile seed crawling is deleted from seed queue.Second crawls module 82, the memory headroom that crawls equipment with reptile is connected, for there is not reptile seed in seed queue, and in crawling queue, exist when crawling URL, from queue to be crawled, obtaining a URL to be crawled crawls, and the URL in the page crawling according to obtained URL to be crawled is added in queue to be crawled, and the URL having crawled is deleted from queue to be crawled.
Wherein, the reptile seed of the present embodiment is that the reptile seed by Fig. 6 or Fig. 7 obtains the reptile seed that equipment gets, the flow process of specifically obtaining about reptile seed can, referring to the flow process of the acquisition methods of reptile seed shown in Fig. 2 or Fig. 3, no longer be described in detail in the present embodiment.
Each functional module that the reptile of the present embodiment crawls equipment can be used for the corresponding flow process in reptile crawling method shown in execution graph 4, and its specific works principle repeats no more, and refers to the description of embodiment of the method.
The reptile of the present embodiment crawls equipment, the reptile seed providing with the embodiment of the present invention obtains equipment and combines, use reptile seed to obtain the reptile seed that equipment obtains and crawl operation, the features such as the data volume that makes full use of reptile seed is large, discretization degree is large, and by preferentially crawling reptile seed, the operational efficiency of the reptile of Optimizing Search engine, has object to complete crawling of mass data in the short period of time; Especially to the rapid deployment as vertical search engine, can shorten greatly the data time of web crawlers, raising system is disposed convenience and practicality.
The reptile that Fig. 9 provides for another embodiment of the present invention crawls the structural representation of equipment.The present embodiment is based on realization embodiment illustrated in fig. 8, and as shown in Figure 9, the reptile of the present embodiment crawls equipment and also comprises following arbitrary load-on module or its combination:
The first load-on module 83, the memory headroom that crawls equipment with reptile is connected, for the reptile seed of reptile seed list is all loaded in the seed queue in memory headroom.The second load-on module 84, the memory headroom that crawls equipment with reptile is connected, for timer access reptile seed list, the reptile seed not being loaded in reptile seed list is loaded in the seed queue of memory headroom, and access identities is set for the reptile seed being loaded in seed queue in reptile seed list.
Above-mentioned each functional module can be used for the corresponding flow process in reptile crawling method shown in execution graph 4 or Fig. 5 A, for reptile seed being obtained to the reptile seed that equipment obtains, be loaded into the memory headroom that reptile crawls equipment, be first to crawl module and second and crawl module and lay the first stone, its specific works principle repeats no more, and refers to the description of embodiment of the method.
The reptile of the present embodiment crawls equipment, the reptile seed providing with the embodiment of the present invention obtains equipment and combines, use reptile seed to obtain the reptile seed that equipment obtains and crawl operation, the features such as the data volume that makes full use of reptile seed is large, discretization degree is large, and by preferentially crawling reptile seed, the operational efficiency of the reptile of Optimizing Search engine, has object to complete crawling of mass data in the short period of time; Especially to the rapid deployment as vertical search engine, can shorten greatly the data time of web crawlers, raising system is disposed convenience and practicality.
One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can complete by the relevant hardware of program command, aforesaid program can be stored in a computer read/write memory medium, this program, when carrying out, is carried out the step that comprises said method embodiment; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CDs.
Finally it should be noted that: above embodiment only, in order to technical scheme of the present invention to be described, is not intended to limit; Although the present invention is had been described in detail with reference to previous embodiment, those of ordinary skill in the art is to be understood that: its technical scheme that still can record aforementioned each embodiment is modified, or part technical characterictic is wherein equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (14)

1. a reptile seed acquisition methods, is characterized in that, comprising:
According to the uniform resource position mark URL characteristic of default term dictionary and target navigation website, the request of structure dynamic page;
Described dynamic page request is sent to the server of described target navigation website;
According to default fetch strategy, the result for retrieval page returning according to described dynamic page request from described server, extract target URL, described target URL is the Main Domain address of the URL in the described result for retrieval page;
Described target URL is carried out to uniqueization processing, obtain uniqueization target URL, using described uniqueization target URL as reptile seed;
Wherein, the URL characteristic of the default term dictionary of described basis and target navigation website, the request of structure dynamic page comprises:
All terms in described term dictionary are loaded into memory headroom;
While having term to exist, obtain a term in described memory headroom;
According to the URL characteristic of described target navigation website, use the hexadecimal code that obtained term is corresponding to construct dynamic URL, to form described dynamic page request;
Wherein, described basis is preset fetch strategy, extracts target URL comprise from described server according to described dynamic page request the result for retrieval page returning:
Receive the current result for retrieval page that described server returns, extract the Main Domain address of the URL in the current result for retrieval page as described target URL, and described target URL is added in uniqueization queue;
When going out the in addition lower one page of the current result for retrieval page according to default searching page threshold decision, obtain lower one page page request, and described lower one page page request is sent to described server, and the current result for retrieval page that the described server of reception returns is carried out in continuation, extract the Main Domain address of the URL in the current result for retrieval page as described target URL, and described target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page, then the term corresponding with described retrieval result page face deleted from described memory headroom.
2. reptile seed acquisition methods according to claim 1, is characterized in that, described according to the URL characteristic of described target navigation website, uses the hexadecimal code that obtained term is corresponding to construct dynamic URL, to form described dynamic page request, comprises:
Judge whether described target navigation website is the special website of presetting;
If judgment result is that, be, by the code conversion of GB2312 form corresponding to obtained term, it is the coding of UTF-8 form, the binary value of the coding of UTF-8 form is converted to hexadecimal value, the dynamic URL that meets the URL characteristic of described special website with the hexadecimal code structure of changing out, to form described dynamic page request;
If the determination result is NO, the binary value of the coding of GB2312 form corresponding to obtained term is converted to hexadecimal value, the dynamic URL that meets the URL characteristic of described target navigation website with the hexadecimal code structure of changing out, to form described dynamic page request.
3. reptile seed acquisition methods according to claim 1 and 2, is characterized in that, describedly sends to the server of described target navigation website to comprise described dynamic page request:
To described server, send HTML (Hypertext Markup Language) HTTP request, described HTTP request comprises described dynamic page request; Or
To described server, send URL access request, described URL access request comprises described dynamic page request.
4. reptile seed acquisition methods according to claim 1, is characterized in that, described in obtain lower one page page request, and send to described server to comprise described lower one page page request:
From the current result for retrieval page, obtain lower one page button click, click described lower one page button click to send described lower one page page request to described server; Or
According to the page rule of term corresponding to described dynamic page request and the current result for retrieval page, construct dynamic URL corresponding to described lower one page retrieval result page face, and dynamic URL corresponding to described lower one page retrieval result page face sent to described server.
5. reptile seed acquisition methods according to claim 1, is characterized in that, described described target URL is carried out to uniqueization processing, obtains uniqueization target URL, and described uniqueization target URL is comprised as reptile seed:
While also there is target URL in described uniqueization queue, from described uniqueization queue, obtain a target URL;
When obtained target URL is Already in reptile seed list, obtained target URL is deleted from uniqueization queue, and judge in described uniqueization queue, whether also there is target URL;
When obtained target URL is not present in described reptile seed list, obtained target URL is stored in described reptile seed list, obtained target URL is deleted from uniqueization queue, and judge in described uniqueization queue, whether also there is target URL;
While there is not target URL in described uniqueization queue, using the target URL storing in described reptile seed list as described reptile seed.
6. a reptile crawling method for the reptile seed that the reptile seed acquisition methods described in right to use requirement 1-5 any one obtains, is characterized in that, comprising:
In seed queue in memory headroom, there is the described reptile kind period of the day from 11 p.m. to 1 a.m, from described seed queue, obtaining a reptile seed crawls, and the URL in the page crawling according to described reptile seed is added in the queue to be crawled of described memory headroom, and the reptile seed crawling is deleted from described seed queue;
In described seed queue, there is not described reptile seed, and described in crawling queue, exist when crawling URL, from queue described to be crawled, obtaining a URL to be crawled crawls, and described in the URL in the page crawling according to obtained URL to be crawled is added in queue to be crawled, and the URL having crawled is deleted from queue described to be crawled.
7. reptile crawling method according to claim 6, is characterized in that, has the described reptile kind period of the day from 11 p.m. to 1 a.m in the seed queue in described memory headroom, obtains a reptile seed and comprise before crawling from described seed queue:
Reptile seed in described reptile seed list is all loaded in the seed queue in described memory headroom; Or
Reptile seed list described in timer access, the reptile seed not being loaded in described reptile seed list is loaded in the seed queue of described memory headroom, and is that the reptile seed being loaded in described seed queue in described reptile seed list arranges access identities.
8. reptile seed obtains an equipment, it is characterized in that, comprising:
Constructing module, for according to the uniform resource position mark URL characteristic of default term dictionary and target navigation website, constructs dynamic page request;
Sending module, for sending to described dynamic page request the server of described target navigation website;
Extraction module, for according to default fetch strategy, extracts target URL the result for retrieval page returning from described server according to described dynamic page request, and described target URL is the Main Domain address of the URL in the described result for retrieval page;
Acquisition module, for described target URL is carried out to uniqueization processing, obtains uniqueization target URL, using described uniqueization target URL as reptile seed;
Wherein, described constructing module comprises:
Loading unit, for being loaded into memory headroom by all terms of described term dictionary;
Term acquiring unit, for when described memory headroom has term to exist, obtains a term;
Structural unit, for according to the URL characteristic of described target navigation website, uses the hexadecimal code that obtained term is corresponding to construct dynamic URL, to form described dynamic page request;
Wherein, described extraction module comprises:
Receive acquiring unit, the current result for retrieval page returning for receiving described server, extracts the Main Domain address of the URL in the current result for retrieval page as described target URL, and described target URL is added in uniqueization queue;
Judgement trigger element, for when going out the in addition lower one page of the current result for retrieval page according to default searching page threshold decision, obtain lower one page page request, and described lower one page page request is sent to described server, and trigger described reception acquiring unit and carry out the current result for retrieval page that the described server of reception returns, extract the Main Domain address of the URL in the current result for retrieval page as described target URL, and described target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page;
Delete cells, for the current result for retrieval page not in the presence of during one page, the term corresponding with described retrieval result page face deleted from described memory headroom.
9. reptile seed according to claim 8 obtains equipment, it is characterized in that, described structural unit is specifically for judging whether described target navigation website is the special website of presetting, when judgment result is that while being, by the code conversion of GB2312 form corresponding to obtained term, it is the coding of UTF-8 form, the binary value of the coding of UTF-8 form is converted to hexadecimal value, the dynamic URL that meets the URL characteristic of described special website with the hexadecimal code structure of changing out, to form described dynamic page request; When the determination result is NO, the binary value of the coding of GB2312 form corresponding to obtained term is converted to hexadecimal value, the dynamic URL that meets the URL characteristic of described target navigation website with the hexadecimal code structure of changing out, to form described dynamic page request.
10. reptile seed according to claim 8 or claim 9 obtains equipment, it is characterized in that, described sending module comprises:
The first transmitting element, for send HTML (Hypertext Markup Language) HTTP request to described server, described HTTP request comprises described dynamic page request; And/or
The second transmitting element, for sending URL access request to described server, described URL access request comprises described dynamic page request.
11. reptile seeds according to claim 8 obtain equipment, it is characterized in that, described judgement trigger element specifically for obtaining lower one page button click from the current result for retrieval page, click described lower one page button click to send described lower one page page request to described server, then trigger described reception acquiring unit and carry out the result for retrieval page that the described server of reception returns, extract the Main Domain address of the URL in the described result for retrieval page as described target URL, and described target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page, or specifically for according to the page rule of term corresponding to described dynamic page request and the current result for retrieval page, construct dynamic URL corresponding to described lower one page retrieval result page face, and dynamic URL corresponding to described lower one page retrieval result page face sent to described server, then trigger described reception acquiring unit and carry out the result for retrieval page that the described server of reception returns, extract the Main Domain address of the URL in the described result for retrieval page as described target URL, and described target URL is added to the operation in uniqueization queue, until there is not lower one page in the current result for retrieval page.
12. reptile seeds according to claim 8 obtain equipment, it is characterized in that, described acquisition module comprises:
Extraction unit for when described uniqueization queue exists target URL, obtains a target URL from described uniqueization queue;
Judging unit, is deleted obtained target URL Already in during reptile seed list for the target URL obtained from described uniqueization queue, and judges in described uniqueization queue, whether also there is target URL;
Storage judging unit, for when obtained target URL is not present in described reptile seed list, obtained target URL is stored in described reptile seed list, obtained target URL is deleted from described uniqueization queue, and judge in described uniqueization queue, whether also there is target URL;
Seed acquiring unit, for when there is not target URL in described uniqueization queue, using the target URL storing in described reptile seed list as described reptile seed.
The reptile of the reptile seed that the reptile seed acquisition methods described in 13. 1 kinds of rights to use requirement 1-5 any one obtains crawls equipment, it is characterized in that, comprising:
First crawls module, for there is the described reptile kind period of the day from 11 p.m. to 1 a.m in the seed queue of memory headroom, from described seed queue, obtaining a reptile seed crawls, and the URL in the page crawling according to described reptile seed is added in the queue to be crawled of described memory headroom, and the reptile seed crawling is deleted from described seed queue;
Second crawls module, for there is not described reptile seed in described seed queue, and described in crawling queue, exist when crawling URL, from queue described to be crawled, obtaining a URL to be crawled crawls, and described in the URL in the page crawling according to obtained URL to be crawled is added in queue to be crawled, and the URL having crawled is deleted from queue described to be crawled.
14. reptiles according to claim 13 crawl equipment, it is characterized in that, also comprise:
The first load-on module, for being all loaded into the reptile seed of described reptile seed list in the seed queue in described memory headroom; And/or
The second load-on module, for reptile seed list described in timer access, the reptile seed not being loaded in described reptile seed list is loaded in the seed queue of described memory headroom, and access identities is set for the reptile seed being loaded in described seed queue in described reptile seed list.
CN201110232595.XA 2011-08-15 2011-08-15 Crawler seed obtaining method and equipment and crawler crawling method and equipment Expired - Fee Related CN102355488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110232595.XA CN102355488B (en) 2011-08-15 2011-08-15 Crawler seed obtaining method and equipment and crawler crawling method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110232595.XA CN102355488B (en) 2011-08-15 2011-08-15 Crawler seed obtaining method and equipment and crawler crawling method and equipment

Publications (2)

Publication Number Publication Date
CN102355488A CN102355488A (en) 2012-02-15
CN102355488B true CN102355488B (en) 2014-01-22

Family

ID=45578982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110232595.XA Expired - Fee Related CN102355488B (en) 2011-08-15 2011-08-15 Crawler seed obtaining method and equipment and crawler crawling method and equipment

Country Status (1)

Country Link
CN (1) CN102355488B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294732B (en) * 2012-03-05 2016-08-03 富士通株式会社 Webpage capture method and reptile
CN103778156A (en) * 2012-10-24 2014-05-07 阿里巴巴集团控股有限公司 Method and device for searching for data and server for data search
CN103077254B (en) * 2013-02-06 2017-11-03 人民日报媒体技术股份有限公司 Webpage acquisition methods and device
CN103617225B (en) * 2013-11-25 2019-03-08 北京奇虎科技有限公司 A kind of associating web pages searching method and system
CN106547778A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 The crawling method and device of webpage
CN105354337A (en) * 2015-12-08 2016-02-24 北京奇虎科技有限公司 Web crawler implementation method and web crawler system
CN107102997A (en) * 2016-02-22 2017-08-29 北京国双科技有限公司 data crawling method and device
CN107291727A (en) * 2016-03-31 2017-10-24 北京国双科技有限公司 The crawling method and device of a kind of reptile
CN106021438B (en) * 2016-05-16 2020-03-03 北京京东尚科信息技术有限公司 Method, device and system for preventing large-batch data from being captured
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN106503211B (en) * 2016-11-03 2019-12-17 福州大学 Method for automatically generating mobile version facing information publishing website
CN107145553A (en) * 2017-04-28 2017-09-08 暴风集团股份有限公司 A kind of Network Data Capture method and system for competitive sports
CN107273409B (en) * 2017-05-03 2020-12-15 广州赫炎大数据科技有限公司 Network data acquisition, storage and processing method and system
CN107480297A (en) * 2017-08-30 2017-12-15 福建中金在线信息科技有限公司 A kind of article recording method and device
CN108182595A (en) * 2017-12-19 2018-06-19 山东浪潮云服务信息科技有限公司 A kind of formulation migration efficiency method and device
CN108924012A (en) * 2018-08-24 2018-11-30 赛尔网络有限公司 Method, equipment, system and the medium of IPv6 name server liveness detection
CN110008390A (en) * 2019-02-27 2019-07-12 深圳壹账通智能科技有限公司 Appraisal procedure, device, computer equipment and the storage medium of application program
CN111339388B (en) * 2019-06-13 2021-07-27 海通证券股份有限公司 Information crawling system
CN110719344B (en) * 2019-10-10 2022-02-15 北京知道创宇信息技术股份有限公司 Domain name acquisition method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101826110A (en) * 2010-04-13 2010-09-08 北京大学 Method for crawling BitTorrent torrent files

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8484180B2 (en) * 2009-06-03 2013-07-09 Yahoo! Inc. Graph-based seed selection algorithm for web crawlers

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101826110A (en) * 2010-04-13 2010-09-08 北京大学 Method for crawling BitTorrent torrent files

Also Published As

Publication number Publication date
CN102355488A (en) 2012-02-15

Similar Documents

Publication Publication Date Title
CN102355488B (en) Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN104125209B (en) Malice website prompt method and router
CN102930059B (en) Method for designing focused crawler
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN102075570B (en) Method for implementing HTTP (hyper text transport protocol) message caching mechanism based on keywords
CN103118007B (en) A kind of acquisition methods of user access activity and system
CN102098229B (en) Method and device for optimizing and auditing uniform resource locator (URL) as well as network device
CN105243159A (en) Visual script editor-based distributed web crawler system
CN106776983B (en) Search engine optimization device and method
CN102624920A (en) Method and device for performing access through proxy server
CN103440139A (en) Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites
CN103294732A (en) Web page crawling method and spider
CN103123630A (en) Method, system, mobile terminal and server for obtaining webpage contents
CN102004770A (en) Webpage auditing method and device
CN104182412A (en) Webpage crawling method and webpage crawling system
Shamrat et al. An effective implementation of web crawling technology to retrieve data from the world wide web (www)
CN102214172A (en) Caching method and caching equipment
CN102880679B (en) A kind of info web storage means and device
KR20180074774A (en) How to identify malicious websites, devices and computer storage media
CN110555146A (en) method and system for generating network crawler camouflage data
WO2016012868A1 (en) Method of and system for crawling a web resource
CN101727471A (en) Website content retrieval system and method
CN103440454B (en) A kind of active honeypot detection method based on search engine keywords
CN103905434A (en) Method and device for processing network data
CN106612336A (en) Picture preloading method and picture preloading device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140122

Termination date: 20210815