CN105550255A - Resource searching and scheduling method and device - Google Patents
Resource searching and scheduling method and device Download PDFInfo
- Publication number
- CN105550255A CN105550255A CN201510901428.8A CN201510901428A CN105550255A CN 105550255 A CN105550255 A CN 105550255A CN 201510901428 A CN201510901428 A CN 201510901428A CN 105550255 A CN105550255 A CN 105550255A
- Authority
- CN
- China
- Prior art keywords
- link
- page
- index page
- scheduling
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a resource searching and scheduling method and device. The resource searching and scheduling method comprises the following steps: obtaining the current body link of an index page to be scheduled; comparing the obtained current body link with a historical body link of the index page to be scheduled; if a comparison result determines that link omissions are in the presence, carrying out page turning scheduling on the index page to be scheduled, and executing a subsequent scheduling operation until no link omissions are in the presence; and if the comparison result determines that no link omissions are in the presence, executing the subsequent scheduling operation. A link omission phenomenon in a resource searching and scheduling process can be avoided, and a resource containing coverage rate is improved.
Description
Technical field
The present invention relates to data searching technology field, particularly relate to resource searching dispatching method and device.
Background technology
In searching network data technology, spider (Spider) system is positioned at the most upstream of search engine data stream, and be responsible for by the collection of resources on internet to local, being supplied to later retrieval and using, is one of main Data Source of search engine.The target of spider system finds exactly and captures all valuable webpages in internet, and for reaching this target, be first exactly find that there is the link being worth webpage, current spider system has certain scheduling mechanism to carry out full discovery resource link as quickly as possible.
Such as: when carrying out the scheduling of resource link, following mechanism can be set:
Mechanism one: the seed excavated was dispatched, can cover all ageing webpages by certain cycle (such as scheduling in 1 day 20 times).
Mechanism two: consider limited flow and a large amount of index pages, general index page (not in seed range of convergence) was dispatched by certain cycle (heavily grabbing once for such as one week).
Above-mentioned scheduling mechanism has at least its own shortcomings:
For mechanism one, when dispatching cycle, interval was shorter, generally can not there is the problem of leaking chain, but may have the waste of flow in seed, namely when adopting a little unon time, wastes flow exactly; When dispatching cycle, interval was longer, leakage chain may be there is in seed.
For mechanism two, because interval dispatching cycle is longer, leakage chain may be there is.
In scheduling process, occur leaking that the situation of chain can reduce Spider system includes coverage rate.
Summary of the invention
In view of the above problems, propose the present invention to provide a kind of overcoming the problems referred to above or the resource searching dispatching method solved the problem at least in part and corresponding resource searching dispatching device, can avoid occurring in resource searching scheduling process leaking chain phenomenon, improve resource and include coverage rate.
The invention provides a kind of resource searching dispatching method, comprising:
Obtain the current topic link of waiting to dispatch index page;
The described current topic link obtained is linked with described history body of waiting to dispatch index page and compares;
If determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations;
If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
In some optional embodiments, if determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, specifically comprise:
When the link of described current topic and history body link there is not common factor time, the current topic link of dispatching one page under index page is waited described in acquisition, and return to perform and described with described, the link of the current topic of acquisition is waited that the history body of dispatching index page links the step compared, until described current topic links and history body links and exists when occuring simultaneously.
In some optional embodiments, the described current topic waiting to dispatch index page that obtains links, and specifically comprises:
Wait the maximal phase seemingly block dispatched in index page described in determining, obtain the current topic link of waiting to dispatch index page.
In some optional embodiments, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:
Obtain the node that expandable mark language XML path xpath is identical, obtain similar piece;
According to position and the area of similar piece, described in determining, wait the maximal phase seemingly block dispatched in index page.
In some optional embodiments, the position of described similar piece is according to described similar piece of width in scheduling index pages, highly, top margin, the left side is apart from determining;
The area of described similar piece is being dispatched the width in index pages according to described similar piece, is highly being determined.
In some optional embodiments, according to position and the area of similar piece, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:
Determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.
In some optional embodiments, described in treat scheduling index page carry out page turning scheduling before, also comprise:
Wait described in acquisition to dispatch the page turning block in index page, by the anchor auchor of canonical coupling link, determine the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.
In some optional embodiments, wait described in acquisition to dispatch the page turning block in index page, specifically comprise:
By the information that the auchor of canonical coupling link comprises, wait described in determining to dispatch the page turning block in index page, obtain the page turning block determined.
In some optional embodiments, if the auchor of described canonical coupling link comprises page turning block feature information, then wait described in determining based on described page turning block message to dispatch the page turning block in index page.
In some optional embodiments, the described anchor auchor by the link of canonical coupling, determines specifically to comprise the uniform resource position mark URL that lower one page links:
When the auchor of described canonical coupling link comprises lower one page link information, link corresponding for this auchor is defined as the URL of lower one page link.
In some optional embodiments, described the link of the current topic of acquisition and described history body of waiting to dispatch index page are linked and compare before, also comprise:
According to the issuing time information in described current topic link included by each link, order extraction time sequence;
According to the issuing time information of adjacent link, judge whether the time series extracted possesses order or backward feature;
When not possessing, by link each in the link of described current topic in chronological order or time backward sort.
The embodiment of the present invention also provides a kind of resource searching dispatching device, comprising:
Acquisition module, for obtaining the current topic link of waiting to dispatch index page;
Comparison module, compares for the described current topic link obtained being linked with described history body of waiting to dispatch index page;
Execution module, if for determining that according to comparative result there is link omits, treating scheduling index page and carrying out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations; If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
In some optional embodiments, described execution module, specifically for:
When the link of described current topic and history body link there is not common factor time, notify that described acquisition module waits to dispatch the current topic link of one page under index page described in obtaining, described comparison module returns described the link by the current topic of acquisition of execution and waits that the history body of dispatching index page links the step compared with described, exist when occuring simultaneously until the link of described current topic and history body link, execution subsequent scheduling operations.
In some optional embodiments, described acquisition module, specifically for:
Wait the maximal phase seemingly block dispatched in index page described in determining, obtain the current topic link of waiting to dispatch index page.
In some optional embodiments, described acquisition module, specifically for:
Obtain the node that expandable mark language XML path xpath is identical, obtain similar piece;
According to position and the area of similar piece, described in determining, wait the maximal phase seemingly block dispatched in index page.
In some optional embodiments, described acquisition module, specifically for:
According to described similar piece of width in scheduling index pages, highly, top margin, the left side apart from determining the position of described similar piece, and according to described similar piece of width in scheduling index pages, highly determine the area of similar piece.
In some optional embodiments, described acquisition module, specifically for:
Determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.
In some optional embodiments, described acquisition module, also for:
Wait described in acquisition to dispatch the page turning block in index page, by the anchor auchor of canonical coupling link, determine the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.
In some optional embodiments, described acquisition module, specifically for:
By the information that the auchor of canonical coupling link comprises, wait described in determining to dispatch the page turning block in index page, obtain the page turning block determined.
In some optional embodiments, described acquisition module, specifically for:
If the auchor of described canonical coupling link comprises page turning block feature information, then wait described in determining based on described page turning block message to dispatch the page turning block in index page.
In some optional embodiments, described acquisition module, specifically for:
When the auchor of described canonical coupling link comprises lower one page link information, link corresponding for this auchor is defined as the URL of lower one page link.
In some optional embodiments, described comparison module, also for:
According to the issuing time information in described current topic link included by each link, order extraction time sequence;
According to the issuing time information of adjacent link, judge whether the time series extracted possesses order or backward feature;
When not possessing, by link each in the link of described current topic in chronological order or time backward sort.
Resource searching dispatching method of the present invention and device, before carrying out scheduling operation, obtain the current topic link of waiting to dispatch index page; The link of the current topic of acquisition is linked with the history body of waiting to dispatch index page and compares; Omit to determine whether there is link, when determining that there is not connection omits, perform subsequent scheduling operations, thus avoid in scheduling of resource process the possibility occurring leaking chain, what improve scheduling resource includes coverage rate, and the method does not need the mode by shortening dispatching cycle to realize, and can not cause the increase of traffic overhead, when saving network traffics resource, the leakage chain phenomenon avoided in scheduling of resource process can be realized.
Further, said method of the present invention, obtains the current topic link of scheduling index page page by page, compares page by page before scheduling, thus the generation avoiding Lou chain as much as possible.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
According to hereafter by reference to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will understand above-mentioned and other objects, advantage and feature of the present invention more.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 is the process flow diagram of resource searching dispatching method in the embodiment of the present invention one;
Fig. 2 is the sectional drawing of an index page in the embodiment of the present invention one;
Fig. 3 is the process flow diagram of resource searching dispatching method in the embodiment of the present invention two;
Fig. 4 determines in the embodiment of the present invention two to wait to dispatch the exemplary plot of the maximal phase in index page like block;
Fig. 5 is the process flow diagram of resource searching dispatching method in the embodiment of the present invention three;
Fig. 6 is the sectional drawing of another index page in the embodiment of the present invention three;
Fig. 7 is the process flow diagram of resource searching dispatching method in the embodiment of the present invention four;
Fig. 8 is the structural representation of resource searching dispatching device in the embodiment of the present invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
In order to solve leakage chain problem during scheduling of resource in prior art, the embodiment of the present invention provides a kind of resource searching dispatching method, and can avoid Lou chain phenomenon as much as possible, and not increase traffic overhead, what can improve scheduling resource includes coverage rate.Be described in detail below by specific embodiment.
Embodiment one:
The resource searching dispatching method that the embodiment of the present invention one provides, its flow process as shown in Figure 1, comprises the steps:
Step S101: obtain the current topic link of waiting to dispatch index page.
When index page is dispatched at every turn, resolve index page web page extraction and record its main body link found.Concrete, dispatching index page for waiting, by determining the maximal phase seemingly block waiting to dispatch in index page, obtaining the current topic link of waiting to dispatch index page.
Wherein, index page refers to that the main body on webpage is link, and the webpage of non-content word.Main body link refers to the link set that the main body on index page webpage is corresponding.Such as, Figure 2 shows that the sectional drawing of an index page, main body link in index page http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml webpage, as shown in the large square frame in Fig. 2, which includes the link of each main body in this index page.
Step S102: the link of the current topic of acquisition is linked with the history body of waiting to dispatch index page and compares.
Get after the current topic link of scheduling index page, obtain the history body link of this scheduling index page, both are compared, to determine whether current topic link exists leakage chain.
Now optional, also can disposable acquisition index page series history body link set.A series of page turnings of index page and correspondence thereof are called index page series, the main body link summation extracted when index page series history body link set can comprise the scheduling of index page series.
Step S103: determine whether there is link according to comparative result and omit.If so, step S104 is performed; If not, step S105 is performed.
According to the comparative result that the current topic link of waiting to dispatch index page links with history body, can determine and treat whether scheduling index page exists leakage chain.Such as when the current topic link of dispatching index page links completely not identical with history body, prove there is leakage chain between twice scheduling.
Step S104: treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, perform step S105.
If determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations.
When determining in time dispatching index page and there is leakage chain, treat scheduling index page and carry out page turning scheduling, the lower one page of scheduling finds Lou chain, if after page turning scheduling, by relatively descending current topic link and the history body link of one page, find still there is leakage chain, then proceed page turning scheduling, till determining not exist when connection is omitted.
Step S105: perform subsequent scheduling operations.
When determining to treat that scheduling index page does not exist link and omits according to comparative result, perform subsequent scheduling operations.Thus avoid Lou chain phenomenon, and reduce the way of leaking chain relative in prior art by shortening dispatching cycle, the method can better avoid Lou chain, and does not need to increase network traffics use amount.Flow system flow resource can be saved.
Embodiment two:
The resource searching dispatching method that the embodiment of the present invention two provides, its flow process as shown in Figure 3, comprises the steps:
Step S201: obtain the current topic link of waiting to dispatch index page.
Dispatching index page for waiting, by determining the maximal phase seemingly block treated in scheduling index page, obtaining the current topic link of waiting to dispatch index page, wherein, determine the maximal phase seemingly block waiting to dispatch in index page, specifically comprise:
Obtain extend markup language (eXtensibleMarkupLanguage, XML) path, be called for short Xpath, identical node, obtains similar piece; According to position and the area of similar piece, determine the maximal phase seemingly block waiting to dispatch in index page.
Optionally, the position of similar piece is according to similar piece of width in scheduling index pages, highly, top margin, the left side is apart from determining; The area of similar piece is being dispatched the width in index pages according to similar piece, is highly being determined.
Optionally, according to position and the area of similar piece, determine and wait to dispatch maximal phase in index page like block, specifically comprise: determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.
Such as: the set that the node (nodes >4) that Xpath is identical forms is referred to herein as similar piece; Calculate the positional information of similar piece, namely in the page width, highly, top margin, left side distance, these four information just can determine the similar piece of position in webpage, the area calculating similar piece according to width and the information such as height of working together.Area maximum and similar piece that comprises central point call that maximal phase is like block.
As shown in Figure 4, wait to dispatch the example of the maximal phase in index page like block for determining.For similar piece of index page http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml webpage, 3 thick line square frames in Fig. 4 have irised out 3 similar piece, and what clearly the square frame of bottom was irised out is maximal phase like block.
Maximal phase is main body link like the link in block, and the link that namely in bottom square frame, each bar news is corresponding is main body link, as:
Next step Integral lifting draining after the shipwreck righting of the Changjiang river:
http://news.sina.com.cn/c/2015-06-05/133131917716.shtml;
Guangzhou Duo Jia hospital pilot Gui Zhai on the same day: day surgery is gone home evening:
http://news.sina.com.cn/c/2015-06-05/125531917706.shtml;
……
Step S202: the link of the current topic of acquisition is linked with the history body of waiting to dispatch index page and compares.
According to the comparative result that the current topic link of waiting to dispatch index page links with history body, determine to treat whether scheduling index page exists leakage chain.
Step S203: judge the current topic link obtained links whether there is common factor with history body.If so, step S205 is performed; If not, step S204 is performed.
According to wait dispatch index page current topic link with history body link whether exist common factor can determine both whether identical.
Step S204: obtain the current topic link of waiting to dispatch one page under index page.And return continuation execution step S202.
When obtain current topic link and history body link there is not common factor time, obtain the current topic link of waiting to dispatch one page under index page, and return perform by the link of the current topic of acquisition with wait that the history body of dispatching index page links the step compared, until current topic links and history body links and exists when occuring simultaneously.
As described in above-mentioned steps S202-step S204, each dispatch wait dispatch index page time, the main body link that the main body link of more current discovery and last scheduling find, if there is no occur simultaneously, namely the main body link that last scheduling finds links completely not identical with the main body of current discovery, then illustrate there is leakage chain between this twice scheduling, under needing scheduling page turning, one page (namely the 2nd page) finds Lou chain, and the link of the main body of current discovery is joined in this index page series history body link set.When dispatching lower one page, in like manner, the main body link of more current discovery and this index page series history body link set, if there is no occur simultaneously, then illustrate still there is leakage chain, continue to initiate page turning and dispatch to lower one page (namely the 3rd page).The rest may be inferred, occurs simultaneously, then illustrate that leaking chain has looked for entirely, does not need the scheduling initiating page turning again until the main body link of current discovery and this index page series history body link set exist.
Step S205: perform subsequent scheduling operations.
When determining to treat that scheduling index page does not exist link and omits according to comparative result, perform subsequent scheduling operations.
The resource searching dispatching method that the embodiment of the present invention three provides, its flow process as shown in Figure 5, comprises the steps:
Step S301: obtain the current topic link of waiting to dispatch index page.
Step S302: the link of the current topic of acquisition is linked with the history body of waiting to dispatch index page and compares.
Step S303: judge the current topic link obtained links whether there is common factor with history body.If so, step S307 is performed; If not, step S304 is performed.
Step S304: obtain and wait to dispatch the page turning block in index page.
In the present embodiment, on the basis of embodiment one and embodiment two, before obtaining the current topic link of waiting to dispatch one page under index page, obtain and wait to dispatch the page turning block in index page, carry out page turning scheduling to facilitate.
Wait described in acquisition to dispatch the page turning block in index page, specifically comprise: the information comprised by the anchor (auchor) of canonical coupling link, determine to wait to dispatch the page turning block in index page, obtain the page turning block determined.
If the auchor of canonical coupling link comprises page turning block feature information, then determine to wait to dispatch the page turning block in index page based on page turning block message.
Such as, in Fig. 2 top rectangular frame shown in be page turning block.Shown in rectangular frame in the middle of in Fig. 4 is also page turning block.Shown page turning block in Fig. 4 also can extract from similar piece of http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml webpage.
Optionally, page turning block can be determined whether by the anchor of canonical coupling link, such as, can be defined as page turning block when comprising at least one in following page turning block feature information: the key word such as " numeral ", " < ", " > ", " << ", " >> ", " page up ", " lower one page ", " first page ", " last page ".
Step S305: by the anchor auchor of canonical coupling link, determine the URL(uniform resource locator) (UniformResoureLocatorURL) that lower one page links, to carry out page turning crawl according to the URL determined.
When the auchor of canonical coupling link comprises lower one page link information, link corresponding for this auchor is defined as the URL of lower one page link, to carry out page turning crawl according to the URL determined.
When determining that lower one page links, to the node in page turning block, by the anchor of canonical coupling link, judge whether to mate lower one page link information, wherein descend one page link information to comprise at least one item in following message: the key word such as " numeral of specifying ", " lower one page ", " the next page ", " > ".If match when comprising lower one page link information, then link corresponding for this anchor is designated as the URL of lower one page link; Otherwise, calculate lower one page number of pages by current page number of pages, then splice the URL of lower one page link.
Node anchor canonical coupling " lower one page " key word as shown in Figure 4 in page turning block, then its lower one page link is the link http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_2.shtml of anchor for " lower one page " correspondence.
Fig. 6 is the part sectional drawing of a webpage, which includes the example of a page turning block, as the rectangular frame of below in Fig. 6 and for page turning block.Under node anchor canonical coupling in page turning block during one page link information, " lower one page ", " the next page ", " > " etc. all fail to mate, therefore lower one page number of pages (current page is index page, i.e. page turning the 1st page, then descend one page number of pages to be then 2) is calculated.From page turning block, find out the link that node anchor mates numeral 2, be http://gold.jrj.com.cn/list/hjzx-2.shtml here.
Step S306: obtain the current topic link of waiting to dispatch one page under index page.And return continuation execution step S302.
When obtain current topic link and history body link there is not common factor time, obtain the current topic link of waiting to dispatch one page under index page, and return perform by the link of the current topic of acquisition with wait that the history body of dispatching index page links the step compared, until current topic links and history body links and exists when occuring simultaneously.
Step S307: perform subsequent scheduling operations.
The above-mentioned step do not elaborated, see the description of the corresponding steps in embodiment one and embodiment two.
Embodiment four:
The resource searching dispatching method that the embodiment of the present invention four provides, its flow process as shown in Figure 7, comprises the steps:
Step S401: obtain the current topic link of waiting to dispatch index page.
Step S402: according to the issuing time information in the current topic link obtained included by each link, order extraction time sequence.
In the present embodiment, the link of the current topic of acquisition and described history body of waiting to dispatch index page are linked before comparing, determine in the current topic link obtained, whether each link possesses time sequencing or reverse rule, when not possessing, it is arranged, thus facilitates the comparison of follow-up main body link.
As shown in Figure 4, in the large square frame in below, what marked by underscore is issuing time corresponding to main body link, and each main body link has an issuing time information corresponding with it, sequentially can extract time series based on this.
Step S403: according to the issuing time information of adjacent link, judges whether the time series extracted possesses order or backward feature; When possessing, perform step S405; When not possessing, perform step S404.
According to the issuing time information of adjacent link, by judging whether the time series extracted possesses order or the feature of backward, realize judging main body link whether in chronological order or backward sort.
Step S404: during current topic is linked each link in chronological order or time backward sort.
When extract time series do not possess time sequencing or backward feature time, illustrates main body link be not in chronological order or backward sequence, conveniently main body link comparison, can to main body link carry out time sequencing or time backward sorting operation.
Step S405: the link of the current topic of acquisition is linked with the history body of waiting to dispatch index page and compares.
Step S406: judge the current topic link obtained links whether there is common factor with history body.If so, step S408 is performed; If not, step S407 is performed.
Step S407: obtain the current topic link of waiting to dispatch one page under index page.And return continuation execution step S402.
When obtain current topic link and history body link there is not common factor time, obtain the current topic link of waiting to dispatch one page under index page, and return perform by the link of the current topic of acquisition with wait that the history body of dispatching index page links the step compared, until current topic links and history body links and exists when occuring simultaneously.
For the index page that main body links in chronological order or backward sorts, during each scheduling index page, the main body link of more current discovery and history body link set, if there is no occur simultaneously, namely the main body link that last scheduling finds links completely not identical with the main body of current discovery, then illustrate there is leakage chain between this twice scheduling, need scheduling page turning the 2nd page to find Lou chain, and the link of the main body of current discovery is joined in this index page series history body link set.When dispatching page turning the 2nd page, in like manner, the main body link of more current discovery and this index page series history body link set, if there is no occur simultaneously, then illustrate still there is leakage chain, continue to initiate page turning the 3rd page scheduling.The rest may be inferred, occurs simultaneously, then illustrate that leaking chain has looked for entirely, does not need the scheduling initiating page turning again until the main body link of current discovery and this index page series history body link set exist.
Main body is linked to the index page sorted in chronological order, pay close attention to up-to-date main body link owing to generally comparing, therefore can consider to dispatch forward from last page, mechanism is all the same.
Step S408: perform subsequent scheduling operations.
The above-mentioned step do not elaborated, see the description of the corresponding steps in embodiment one, embodiment two and embodiment three.
The optional step increased relative to embodiment one and embodiment two in above-described embodiment three and embodiment four, can a part of step of independent increase wherein described by each embodiment described in embodiment three and embodiment four, also can increase the step increased in embodiment three and embodiment four on the basis of embodiment one and embodiment two, this kind of implementation is no longer described in detail with new embodiment simultaneously.
Based on same inventive concept, the embodiment of the present invention also provides a kind of resource searching dispatching device, and the structure of this device as shown in Figure 8, comprising: acquisition module 801, comparison module 802 and execution module 803.
Acquisition module 801, for obtaining the current topic link of waiting to dispatch index page.
Comparison module 802, compares for the link of the current topic of acquisition being linked with described history body of waiting to dispatch index page.
Execution module 803, if for determining that according to comparative result there is link omits, treating scheduling index page and carrying out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations; If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
Optionally, above-mentioned execution module 803, when there is not common factor specifically for linking when the current topic link obtained and history body, notice acquisition module 801 obtains the current topic link of waiting to dispatch one page under index page, comparison module 802 returns execution and is linked by the current topic of acquisition and wait that the history body of dispatching index page links the step compared, until obtain current topic link with history body link exist occur simultaneously time, execution subsequent scheduling operations.
Optionally, above-mentioned acquisition module 801, specifically for determining the maximal phase seemingly block waiting to dispatch in index page, obtains the current topic link of waiting to dispatch index page.
Optionally, above-mentioned acquisition module 801, specifically for obtaining the identical node of expandable mark language XML path xpath, obtains similar piece; According to position and the area of similar piece, determine the maximal phase seemingly block waiting to dispatch in index page.
Optionally, above-mentioned acquisition module 801, specifically for according to similar piece of width in scheduling index pages, highly, top margin, the left side apart from determining the position of described similar piece, and according to similar piece of width in scheduling index pages, highly determine the area of similar piece.
Optionally, above-mentioned acquisition module 801, specifically for determining that area in similar piece is maximum and comprising similar piece of page central point for maximal phase is like block.
Optionally, above-mentioned acquisition module 801, also waiting to dispatch the page turning block in index page for obtaining, by the anchor auchor of canonical coupling link, determining the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.
Optionally, above-mentioned acquisition module 801, specifically for the information comprised by the auchor of canonical coupling link, is determined to wait to dispatch the page turning block in index page, obtains the page turning block determined.
Optionally, above-mentioned acquisition module 801, if comprise page turning block feature information specifically for the auchor of canonical coupling link, then determines to wait to dispatch the page turning block in index page based on page turning block message.
Optionally, above-mentioned acquisition module 801, when the auchor specifically for the link of canonical coupling comprises lower one page link information, is defined as the URL of lower one page link by link corresponding for this auchor.
Optionally, above-mentioned comparison module 802, also for the issuing time information included by link each in current topic link, order extraction time sequence; According to the issuing time information of adjacent link, judge whether the time series extracted possesses order or backward feature; When not possessing, during the current topic of acquisition is linked each link in chronological order or time backward sort.
The resource searching dispatching method that the embodiment of the present invention provides and device are the scheduling mechanisms for particular index page, and judge to realize scheduling based on leakage chain, that improves resource searching with most economical flow cost includes coverage rate.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in detail in the claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the resource searching controlling equipment of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
So far, those skilled in the art will recognize that, although multiple exemplary embodiment of the present invention is illustrate and described herein detailed, but, without departing from the spirit and scope of the present invention, still can directly determine or derive other modification many or amendment of meeting the principle of the invention according to content disclosed by the invention.Therefore, scope of the present invention should be understood and regard as and cover all these other modification or amendments.
Based on one aspect of the present invention, also disclose A1. resource searching dispatching method, comprising:
Obtain the current topic link of waiting to dispatch index page;
The described current topic link obtained is linked with described history body of waiting to dispatch index page and compares;
If determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations;
If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
A2. the method according to A1, wherein, if determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, specifically comprise:
When the link of described current topic and history body link there is not common factor time, the current topic link of dispatching one page under index page is waited described in acquisition, and return to perform and described with described, the link of the current topic of acquisition is waited that the history body of dispatching index page links the step compared, until described current topic links and history body links and exists when occuring simultaneously.
A3. the method according to A1, wherein, the described current topic waiting to dispatch index page that obtains links, and specifically comprises:
Wait the maximal phase seemingly block dispatched in index page described in determining, obtain the current topic link of waiting to dispatch index page.
A4. the method according to A3, wherein, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:
Obtain the node that expandable mark language XML path xpath is identical, obtain similar piece;
According to position and the area of similar piece, described in determining, wait the maximal phase seemingly block dispatched in index page.
A5. the method according to A4, wherein, the position of described similar piece is according to described similar piece of width in scheduling index pages, highly, top margin, the left side is apart from determining;
The area of described similar piece is being dispatched the width in index pages according to described similar piece, is highly being determined.
A6. the method according to A4, wherein, according to position and the area of similar piece, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:
Determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.
A7. the method according to A1, wherein, described in treat scheduling index page carry out page turning scheduling before, also comprise:
Wait described in acquisition to dispatch the page turning block in index page, by the anchor auchor of canonical coupling link, determine the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.
A8. the method according to A7, wherein, wait described in acquisition to dispatch the page turning block in index page, specifically comprise:
By the information that the auchor of canonical coupling link comprises, wait described in determining to dispatch the page turning block in index page, obtain the page turning block determined.
A9. the method according to A8, wherein, if the auchor of described canonical coupling link comprises page turning block feature information, then waits described in determining based on described page turning block message to dispatch the page turning block in index page.
A10. the method according to A7, wherein, the described anchor auchor by the link of canonical coupling, determines specifically to comprise the uniform resource position mark URL that lower one page links:
When the auchor of described canonical coupling link comprises lower one page link information, link corresponding for this auchor is defined as the URL of lower one page link.
A11. according to the arbitrary described method of A1-A10, wherein, described the link of the current topic of acquisition and described history body of waiting to dispatch index page are linked compare before, also comprise:
According to the issuing time information in described current topic link included by each link, order extraction time sequence;
According to the issuing time information of adjacent link, judge whether the time series extracted possesses order or backward feature;
When not possessing, by link each in the link of described current topic in chronological order or time backward sort.
Based on another aspect of the present invention, also disclose B12. resource searching dispatching device, comprising:
Acquisition module, for obtaining the current topic link of waiting to dispatch index page;
Comparison module, compares for the described current topic link obtained being linked with described history body of waiting to dispatch index page;
Execution module, if for determining that according to comparative result there is link omits, treating scheduling index page and carrying out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations; If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
B13. the device according to B12, wherein, described execution module, specifically for:
When the link of described current topic and history body link there is not common factor time, notify that described acquisition module waits to dispatch the current topic link of one page under index page described in obtaining, described comparison module returns described the link by the current topic of acquisition of execution and waits that the history body of dispatching index page links the step compared with described, exist when occuring simultaneously until the link of described current topic and history body link, execution subsequent scheduling operations.
B14. the device according to B12, wherein, described acquisition module, specifically for:
Wait the maximal phase seemingly block dispatched in index page described in determining, obtain the current topic link of waiting to dispatch index page.
B15. the device according to B14, wherein, described acquisition module, specifically for:
Obtain the node that expandable mark language XML path xpath is identical, obtain similar piece;
According to position and the area of similar piece, described in determining, wait the maximal phase seemingly block dispatched in index page.
B16. the device according to B15, wherein, described acquisition module, specifically for:
According to described similar piece of width in scheduling index pages, highly, top margin, the left side apart from determining the position of described similar piece, and according to described similar piece of width in scheduling index pages, highly determine the area of similar piece.
B17. the device according to B15, wherein, described acquisition module, specifically for:
Determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.
B18. the device according to B12, wherein, described acquisition module, also for:
Wait described in acquisition to dispatch the page turning block in index page, by the anchor auchor of canonical coupling link, determine the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.
B19. the device according to B18, wherein, described acquisition module, specifically for:
By the information that the auchor of canonical coupling link comprises, wait described in determining to dispatch the page turning block in index page, obtain the page turning block determined.
B20. the device according to B19, wherein, described acquisition module, specifically for:
If the auchor of described canonical coupling link comprises page turning block feature information, then wait described in determining based on described page turning block message to dispatch the page turning block in index page.
B21. the device according to B18, wherein, described acquisition module, specifically for:
When the auchor of described canonical coupling link comprises lower one page link information, link corresponding for this auchor is defined as the URL of lower one page link.
B22. according to the arbitrary described device of B12-B21, wherein, described comparison module, also for:
According to the issuing time information in described current topic link included by each link, order extraction time sequence;
According to the issuing time information of adjacent link, judge whether the time series extracted possesses order or backward feature;
When not possessing, by link each in the link of described current topic in chronological order or time backward sort.
Claims (10)
1. a resource searching dispatching method, comprising:
Obtain the current topic link of waiting to dispatch index page;
The described current topic link obtained is linked with described history body of waiting to dispatch index page and compares;
If determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations;
If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
2. method according to claim 1, wherein, if determine that there is link omits according to comparative result, treat scheduling index page and carry out page turning scheduling, until when determining that there is not connection omits, specifically comprise:
When the link of described current topic and history body link there is not common factor time, the current topic link of dispatching one page under index page is waited described in acquisition, and return to perform and described with described, the link of the current topic of acquisition is waited that the history body of dispatching index page links the step compared, until described current topic links and history body links and exists when occuring simultaneously.
3. method according to claim 1 and 2, wherein, the described current topic waiting to dispatch index page that obtains links, and specifically comprises:
Wait the maximal phase seemingly block dispatched in index page described in determining, obtain the current topic link of waiting to dispatch index page.
4. according to the arbitrary described method of claim 1-3, wherein, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:
Obtain the node that expandable mark language XML path xpath is identical, obtain similar piece;
According to position and the area of similar piece, described in determining, wait the maximal phase seemingly block dispatched in index page.
5. according to the arbitrary described method of claim 1-4, wherein, the position of described similar piece is according to described similar piece of width in scheduling index pages, highly, top margin, the left side is apart from determining;
The area of described similar piece is being dispatched the width in index pages according to described similar piece, is highly being determined.
6., according to the arbitrary described method of claim 1-5, wherein, according to position and the area of similar piece, wait the maximal phase seemingly block dispatched in index page described in determining, specifically comprise:
Determine that area in similar piece is maximum and comprise similar piece of page central point for maximal phase is like block.
7. according to the arbitrary described method of claim 1-6, wherein, described in treat before scheduling index page carries out page turning scheduling, also comprise:
Wait described in acquisition to dispatch the page turning block in index page, by the anchor auchor of canonical coupling link, determine the uniform resource position mark URL that lower one page links, to carry out page turning crawl according to described URL.
8. according to the arbitrary described method of claim 1-7, wherein, wait described in acquisition to dispatch the page turning block in index page, specifically comprise:
By the information that the auchor of canonical coupling link comprises, wait described in determining to dispatch the page turning block in index page, obtain the page turning block determined.
9. according to the arbitrary described method of claim 1-8, wherein, if the auchor of described canonical coupling link comprises page turning block feature information, then wait described in determining based on described page turning block message to dispatch the page turning block in index page.
10. a resource searching dispatching device, comprising:
Acquisition module, for obtaining the current topic link of waiting to dispatch index page;
Comparison module, compares for the described current topic link obtained being linked with described history body of waiting to dispatch index page;
Execution module, if for determining that according to comparative result there is link omits, treating scheduling index page and carrying out page turning scheduling, until when determining that there is not connection omits, perform subsequent scheduling operations; If determine that there is not link omits, and performs subsequent scheduling operations according to comparative result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510901428.8A CN105550255B (en) | 2015-12-08 | 2015-12-08 | A kind of resource searching dispatching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510901428.8A CN105550255B (en) | 2015-12-08 | 2015-12-08 | A kind of resource searching dispatching method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105550255A true CN105550255A (en) | 2016-05-04 |
CN105550255B CN105550255B (en) | 2019-04-05 |
Family
ID=55829444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510901428.8A Active CN105550255B (en) | 2015-12-08 | 2015-12-08 | A kind of resource searching dispatching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105550255B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004102426A1 (en) * | 2003-05-16 | 2004-11-25 | Nhn Corporation | A method of providing website searching service and a system thereof |
CN101179472A (en) * | 2007-05-31 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Network resource searching method and searching system |
US7844593B2 (en) * | 2005-09-23 | 2010-11-30 | Tecent Technology (Shenzhen) Company Limited | Method and system for network search |
-
2015
- 2015-12-08 CN CN201510901428.8A patent/CN105550255B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004102426A1 (en) * | 2003-05-16 | 2004-11-25 | Nhn Corporation | A method of providing website searching service and a system thereof |
US7844593B2 (en) * | 2005-09-23 | 2010-11-30 | Tecent Technology (Shenzhen) Company Limited | Method and system for network search |
CN101179472A (en) * | 2007-05-31 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Network resource searching method and searching system |
Non-Patent Citations (1)
Title |
---|
杜言琦: ""面向论坛页面的增量搜集技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN105550255B (en) | 2019-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103605764A (en) | Web crawler system and web crawler multitask executing and scheduling method | |
CN102043862B (en) | Directional web data extraction method | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN102982162B (en) | The acquisition system of info web | |
CN104915260A (en) | Hadoop cluster management task distributing method and system | |
CN102982161A (en) | Method and device for acquiring webpage information | |
CN105550268A (en) | Big data process modeling analysis engine | |
CA2790479C (en) | Partitioning a search space for distributed crawling | |
KR20170139556A (en) | System and method for querying data sources | |
Herrán et al. | A parallel variable neighborhood search approach for the obnoxious p‐median problem | |
JP2014525640A (en) | Expansion of parallel processing development environment | |
WO2014117295A1 (en) | Performing an index operation in a mapreduce environment | |
CN101739398A (en) | Distributed database multi-join query optimization algorithm | |
CN109460235A (en) | Configuration software compiling method | |
CN106202108A (en) | Web crawlers captures method for allocating tasks and device and data grab method and device | |
CN109376142A (en) | Data migration method and terminal device | |
CN104462158A (en) | Data grabbing method and data grabbing system | |
CN111444412B (en) | Method and device for scheduling web crawler tasks | |
KR101287371B1 (en) | Method and Device for Collecting Web Contents and Computer-readable Recording Medium for the same | |
CN102760073A (en) | Method, system and device for scheduling task | |
CN104050297A (en) | Inquiry transaction distribution method and device | |
Seshadri et al. | Optimizing multiple queries in distributed data stream systems | |
CN102087665B (en) | Automatic service combination method for supporting continuous query and system thereof | |
CN101393563A (en) | Web data processing method based on form concept analysis | |
CN110457555A (en) | Collecting method, device and computer equipment, storage medium based on Docker |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220808 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |