CN106649362A - Webpage crawling method and apparatus - Google Patents

Webpage crawling method and apparatus Download PDF

Info

Publication number
CN106649362A
CN106649362A CN201510729544.6A CN201510729544A CN106649362A CN 106649362 A CN106649362 A CN 106649362A CN 201510729544 A CN201510729544 A CN 201510729544A CN 106649362 A CN106649362 A CN 106649362A
Authority
CN
China
Prior art keywords
keyword
crucial phrase
task queue
server
crawled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510729544.6A
Other languages
Chinese (zh)
Other versions
CN106649362B (en
Inventor
何熠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510729544.6A priority Critical patent/CN106649362B/en
Publication of CN106649362A publication Critical patent/CN106649362A/en
Application granted granted Critical
Publication of CN106649362B publication Critical patent/CN106649362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage crawling method and apparatus. The method comprises the steps that a plurality of servers obtain keyword sets from a task queue, wherein the task queue stores a plurality of to-be-crawled keyword sets, and each to-be-crawled keyword set contains a plurality of keywords; and the servers crawl search engine result pages corresponding to all the keywords in the obtained keyword sets through respective network crawlers. According to the method and the apparatus, the technical problem of relatively low efficiency of crawling keyword search engine result pages through a network crawler of a single server in related technologies is solved.

Description

Web page crawl method and apparatus
Technical field
The application is related to internet arena, in particular to a kind of web page crawl method and apparatus.
Background technology
In traditional search engine optimization (Search Engine Optimization, referred to as SEO) business, generally need Help customer analysis keyword ranking in a search engine.Generally, user can preset one group of keyword, periodically logical Cross web crawlers to go to crawl these keywords ranking in a search engine, i.e., keyword correspondence is crawled by web crawlers Search-engine results page, wherein, the corresponding search-engine results page of keyword is referred in search engine (for example, hundred The search engines such as degree, search dog) the middle result of page searching for being input into display after keyword.
But, search engine in order to prevent robot (for example, web crawlers) access, or reduce abnormal access Flow, is often limited the search speed or searching times of single ip address (i.e. anti-reptile strategy), and The keyword that generally user specifies is added up and can reach a very big quantity, therefore, only by single machine or Person's IP address carry out keyword search engine result page crawl not only crawl it is less efficient, and easily because of search engine Anti- reptile strategy lead to not crawl the corresponding search-engine results page of all keywords.
During for crawling the corresponding search-engine results page of keyword by the web crawlers of single server in correlation technique Less efficient problem, not yet proposes at present effective solution.
The content of the invention
The main purpose of the application is to provide a kind of web page crawl method and apparatus, to solve correlation technique in by single The web crawlers of platform server crawls problem less efficient during keyword search engine result page.
To achieve these goals, according to the one side of the application, there is provided a kind of web page crawl method.The method Including:Multiple servers obtain respectively crucial phrase from task queue, wherein, be stored with multiple treating in task queue The crucial phrase for crawling, each crucial phrase to be crawled includes multiple keywords;And multiple servers pass through respectively Respective web crawlers crawls the corresponding search-engine results page of each keyword in the crucial phrase of acquisition.
Further, multiple servers include first server, and multiple servers obtain crucial from task queue respectively Phrase obtains crucial phrase including first server from task queue, and first server obtains key from task queue Phrase includes:Whether there is crucial phrase to be crawled in first server Detection task queue;First server is in inspection Measure and exist in task queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking is only Can be read by first server;And first server obtains crucial phrase from the task queue of locking, and discharge Task queue, wherein, the task queue after release can be read by any one server in multiple servers.
Further, multiple servers include first server, and the crucial phrase that first server is obtained is first crucial Phrase, the web crawlers of first server is first network reptile, and multiple servers pass through respectively respective web crawlers Crawl the corresponding search-engine results page of each keyword in the crucial phrase of acquisition and pass through the first net including first server Network reptile crawls the corresponding search-engine results page of each keyword in the first crucial phrase, and first server passes through first Web crawlers crawls the corresponding search-engine results page of each keyword in the first crucial phrase to be included:Travel through first crucial Phrase, by first network reptile the corresponding search-engine results page of each keyword in the first crucial phrase is crawled;Sentence Whether disconnected first network reptile crawls the corresponding search-engine results page of each keyword in the first crucial phrase successful;With And when judging to exist the situation of the corresponding search-engine results page failure of the keyword crawled in the first crucial phrase, The keyword that failure is crawled in first crucial phrase is added to failed list.
Further, the keyword that failure is crawled in the first crucial phrase is being added to failed list, the method Also include:Keyword in failed list is packaged as into new crucial phrase;And by new crucial phrase add to appoint In business queue.
Further, the keyword in failed list is packaged as into new crucial phrase includes:Obtain in failed list and close The number of retries of keyword;Judge the number of retries of keyword in failed list whether less than preset value;And judging When the number of retries of keyword is less than preset value in failed list, keyword in failed list is packaged as into new keyword Group.
Further, before multiple servers obtain crucial phrase from task queue respectively, method also includes:Press Multiple keywords are grouped according to preset rules, obtain the crucial phrase of multiple groups;And by the pass of multiple groups Keyword group is stored in task queue according to priority.
To achieve these goals, according to the another aspect of the application, there is provided a kind of web page crawl device.The device Including:Acquiring unit, for making multiple servers obtain crucial phrase from task queue respectively, wherein, task team Be stored with multiple crucial phrases to be crawled in row, and each crucial phrase to be crawled includes multiple keywords;And climb Unit is taken, for making multiple servers crawl each key in the crucial phrase of acquisition by respective web crawlers respectively The corresponding search-engine results page of word.
Further, multiple servers include first server, and acquiring unit includes:Detection module, for making first Whether there is crucial phrase to be crawled in server Detection task queue;Locking module, for making first server exist Detect and exist in task queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking Only can be read by first server;And acquisition module, for making first server obtain from the task queue of locking Crucial phrase is taken, and discharges task queue, wherein, the task queue after release can be by any one in multiple servers Platform server reads.
Further, multiple servers include first server, and the crucial phrase that first server is obtained is first crucial Phrase, the web crawlers of first server is first network reptile, and crawling unit includes:Module is crawled, for traveling through First crucial phrase, by first network reptile the corresponding search engine knot of each keyword in the first crucial phrase is crawled Fruit page;Judge module, judges that first network reptile crawls the corresponding search engine of each keyword in the first crucial phrase Whether result page is successful;And add module, for judging there is the keyword pair crawled in the first crucial phrase During the situation of the search-engine results page failure answered, the keyword that failure is crawled in the first crucial phrase is added to failure List.
Further, the device also includes:Packaged unit, for the keyword in failed list to be packaged as into new pass Keyword group;And adding device, for new crucial phrase to be added into task queue.
The application obtains respectively crucial phrase by multiple servers from task queue, wherein, store in task queue There are multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords;And multiple servers Crawl the corresponding search-engine results page of each keyword in the crucial phrase of acquisition by respective web crawlers respectively, The application crawls in a distributed manner the corresponding search-engine results page of keyword by multiple servers, climbs such that it is able to improve Take the efficiency of the corresponding search-engine results page of keyword, it is also possible to reduce the possibility of the anti-reptile strategy of triggering search engine Property, solve efficiency when crawling keyword search engine result page by the web crawlers of single server in correlation technique Relatively low problem, and then reached the efficiency effect that raising crawls the corresponding search-engine results page of keyword.
Description of the drawings
The accompanying drawing for constituting the part of the application is used for providing further understanding of the present application, the schematic reality of the application Apply example and its illustrate for explaining the application, do not constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the flow chart of the web page crawl method according to the embodiment of the present application;
Fig. 2 is the distributed schematic diagram for crawling webpage according to the embodiment of the present application;And
Fig. 3 is the schematic diagram of the web page crawl device according to the embodiment of the present application.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Below with reference to the accompanying drawings and in conjunction with the embodiments describing the application in detail.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment The only embodiment of the application part, rather than the embodiment of whole.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, all should belong to The scope of the application protection.
It should be noted that the description and claims of this application and the term " first " in above-mentioned accompanying drawing, " Two " it is etc. the object for distinguishing similar, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein.Additionally, term " comprising " and " having " and their any deformation, it is intended that covering is non-exclusive to be included, for example, comprising The process of series of steps or unit, method, system, product or equipment are not necessarily limited to those steps clearly listed Rapid or unit, but may include clearly not listing or intrinsic for these processes, method, product or equipment Other steps or unit.
For the ease of description, some concepts being related to the application below are illustrated:
Search engine optimization, i.e. Search Engine Optimization, referred to as SEO.Search engine optimization is one Plant the mode that ranking of the targeted website in about search engine is improved using the search rule of search engine.
Queue, is a kind of special linear list, and it allows to carry out deletion action in the front end (front) of table, and The rear end (rear) of table carries out insertion operation, the also referred to as linear list of first in first out (First In First Out), referred to as For FIFO tables, the form of the queue of the embodiment of the present application can be distributed queue's component, it would however also be possible to employ database Form.
Web crawlers:Be otherwise known as webpage spider or network robot, is that one kind captures ten thousand dimensions automatically according to preset rules The program or script of net information.
According to the embodiment of the present application, there is provided a kind of web page crawl method.Fig. 1 is the webpage according to the embodiment of the present application The flow chart of crawling method, as shown in figure 1, the method includes steps S102 to step S104:
Step S102, multiple servers obtain respectively crucial phrase from task queue, wherein, store in task queue There are multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords.
Specifically, multiple keywords of user preset can be grouped by scheduler, and will be obtained after packet Crucial phrase is positioned in task queue.Preferably, it is corresponding in order to ensure preferentially to crawl the high keyword of significance level Search-engine results page, before multiple servers obtain crucial phrase from task queue respectively, the method also includes: Multiple keywords are grouped according to preset rules, obtain the crucial phrase of multiple groups;And by multiple groups Crucial phrase is stored in task queue according to priority.
Specifically, the embodiment of the present application can to multiple keywords according to significance level (for example, it is desired to preferred process Keyword significance level is high) it is ranked up, and according to the crucial phrase for taking predetermined number successively that sorts into one group, for example, Altogether 300 keywords, this 300 keywords are ranked up according to significance level, and according to this 300 keys The sequence of word, takes successively 50 crucial phrases into a crucial phrase, can obtain 6 crucial phrases, and by this 6 Individual crucial phrase is added into task queue according to priority, wherein, the high crucial phrase priority of significance level is high, The low crucial phrase priority of significance level is low.
Multiple servers of the embodiment of the present application obtain crucial phrase from task queue and carry out web page crawl task respectively, As shown in Fig. 2 all keywords are divided into N groups, i.e. crucial phrase 1 to crucial phrase N by scheduler, three clothes Business device (i.e. server 1, server 2 and server 3) obtains successively crucial phrase from task queue, for example, clothes Business device 1 obtains crucial phrase 1, server 2 and obtains crucial phrase 2, the acquisition crucial phrase 3 of server 3, each clothes When the crawling task of crucial phrase that business device gets in process, can be recorded the state of current key phrase, example Such as, record current key phrase task status (for example, crawl successfully, crawl failure, wait, crawl it is medium), Number of retries (for example, its initial value can be set to 0, plus 1 the number of times that reattempts on failure is crawled) etc..Processing When crawling task of each keyword in the crucial phrase for getting, obtains new keyword from task queue again Group is processed, by that analogy.
It should be noted that the embodiment of the present application can be merged to identical keyword in multiple keywords, to subtract Total amount is crawled less.The embodiment of the present application can also adopt multiple tasks queue, and different strategy is crawled (for example, to match Different search engine, crawl limit etc.), for example, task queue 1 is used for keyword of the storage based on Baidu search, Task queue 2 is used for keyword of the storage based on search dog search.
Alternatively, multiple servers include first server, and multiple servers obtain respectively keyword from task queue Group includes that first server obtains crucial phrase from task queue, and first server obtains keyword from task queue Group includes:Whether there is crucial phrase to be crawled in first server Detection task queue;First server is in detection Go out on missions and exist in queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking is only capable of It is enough to be read by first server;And first server obtains crucial phrase from the task queue of locking, and release is appointed Business queue, wherein, the task queue after release can be read by any one server in multiple servers.
The first server of the embodiment of the present application can be any one server in multiple servers.Specifically, first Server can whether there is crucial phrase to be crawled in first Detection task queue, not deposit in task queue is detected When wait the crucial phrase for crawling, then wait and whether there is in Detection task queue again after Preset Time pass to be crawled Keyword group, when existing in detecting task queue wait the crucial phrase for crawling, then locks task queue, so that Other servers cannot now access the task queue, be clashed with avoiding multithreading from reading task queue simultaneously, First server reads after crucial phrase from task queue, and release task queue (is solved to task queue Lock) so that all of server can conduct interviews to the task queue.
Step S104, it is crucial that multiple servers crawl respectively in the crucial phrase of acquisition each by respective web crawlers The corresponding search-engine results page of word.
The multiple servers of the embodiment of the present application are provided with web crawlers, and in multiple servers keyword is got respectively After group, multiple servers crawl the corresponding search of each keyword in the crucial phrase of acquisition by respective web crawlers Engine results page, wherein, the corresponding search-engine results page of keyword refers to that (for example, Baidu, searches in search engine The search engines such as dog) the middle result of page searching for being input into display after keyword.Illustrate as a example by shown in Fig. 2, take After crucial phrase 1 is got, server 1 crawls each key in crucial phrase 1 to business device 1 by its web crawlers The corresponding search-engine results page of word, server 2 and server 3 crawl webpage process with server 1.
The embodiment of the present application is by the corresponding search-engine results page of the multiple keywords of the distributed process of multiple servers Task is crawled, the efficiency for crawling the corresponding search-engine results page of multiple keywords on the one hand can be improved, on the other hand When keyword quantity is excessive, the possibility of the anti-reptile strategy of triggering search engine can be reduced, to ensure to get The corresponding search-engine results page of all keywords.
The embodiment of the present application obtains respectively crucial phrase by multiple servers from task queue, wherein, task queue In be stored with multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords;And multiple stage Server crawls respectively the corresponding search engine knot of each keyword in the crucial phrase of acquisition by respective web crawlers Fruit page, the embodiment of the present application crawls in a distributed manner the corresponding search-engine results page of keyword by multiple servers, from And the efficiency for crawling the corresponding search-engine results page of keyword can be improved, it is also possible to reducing triggering, search engine is counter climbs The possibility of worm strategy, solves in correlation technique and crawls keyword search engine by the web crawlers of single server Less efficient problem during result page, and then reached the efficiency that raising crawls the corresponding search-engine results page of keyword Effect.
Preferably, in order to avoid the keyword for crawling failure is missed, multiple servers include first server, first The crucial phrase that server is obtained is the first crucial phrase, and the web crawlers of first server is first network reptile, many Platform server crawls respectively the corresponding search engine of each keyword in the crucial phrase of acquisition by respective web crawlers Result page crawls the corresponding search of each keyword in the first crucial phrase including first server by first network reptile Engine results page, first server crawls in the first crucial phrase that each keyword is corresponding to be searched by first network reptile Index holds up result page to be included:The first crucial phrase is traveled through, each in the first crucial phrase is crawled by first network reptile The corresponding search-engine results page of keyword;Judge that first network reptile crawls each keyword pair in the first crucial phrase Whether the search-engine results page answered is successful;And judging that there is the keyword crawled in the first crucial phrase corresponds to Search-engine results page failure situation when, the keyword that failure is crawled in the first crucial phrase is added to unsuccessfully arranging Table.
The embodiment of the present application is illustrated by taking first server as an example, and specifically, first server travels through the first keyword Each keyword in group, and it is corresponding to crawl each keyword by its web crawlers (i.e. first network reptile) Search-engine results page, but, due to Network Abnormal, server exception, data parsing exception and the anti-reptile of triggering etc. Reason can cause web crawlers to crawl the corresponding search-engine results page failure of keyword, that is, fail and get key The corresponding search-engine results page of word.Therefore, the embodiment of the present application detects to the result that crawls of web crawlers, such as Fruit web crawlers successfully crawls the corresponding search-engine results page of each keyword in crucial phrase, records the key Phrase does well as success, if there is crawling in crucial phrase certain or some keyword corresponding search engine knots When fruit page fails, then the keyword for crawling failure is added into failed list, to carry out obtaining the keyword of failure Mark, and be failure by the process state recording of the crucial phrase, and number of retries adds 1.The embodiment of the present application is led to Cross the keyword to crawling failure to record, the keyword for crawling failure can be avoided to be missed.
It should be noted that when certain server crawls the corresponding search-engine results page of keyword and fails, Ke Yirang The task of crawling is re-executed again after the server dormancy Preset Time.
Preferably, the keyword that failure is crawled in the first crucial phrase is being added to failed list, the method is also Including:Keyword in failed list is packaged as into new crucial phrase;And add new crucial phrase to task In queue.
The embodiment of the present application is adding the keyword for crawling failure to failed list, obtains from failed list and climbs The keyword of failure is taken, and the keyword for crawling failure is repacked is stored in task team for a new crucial phrase In row, so as to the keyword for crawling failure can be crawled again, such that it is able to avoid crawling the keyword pair of failure The data answered are not missed, you can to ensure to crawl the corresponding search-engine results page of whole keywords.
Preferably, the keyword in failed list is packaged as into new crucial phrase includes:Obtain crucial in failed list The number of retries of word;Judge the number of retries of keyword in failed list whether less than preset value;And judging to lose When the number of retries for losing keyword in list is less than preset value, keyword in failed list is packaged as into new crucial phrase.
In actual conditions, some keywords still fail after may being repeatedly crawled and get its corresponding search engine Result page, in order to save system resource, can stop crawling task to these keywords, be gone by the mode such as manual Obtain the corresponding search-engine results page of these keywords.
Specifically, the application is lost due to pre-recorded the number of retries of crucial phrase (i.e. the frequency of failure) by obtaining Lose in list the number of retries of keyword and be compared with preset value, to failure if number of retries is less than preset value Keyword in list is packed and is stored into task queue, does not carry out beating if number of retries is more than preset value Bag process.
As can be seen from the above description, the embodiment of the present application is for substantial amounts of keyword, by what is broken the whole up into parts Mode, is assigned on the machine of different IP addresses (i.e. server), reaches the distributed purpose for crawling, while can be with Reduce the possibility of the anti-reptile of triggering;(anti-reptile mechanism triggering is often as in the case of crawling unsuccessfully), pass through The mode of restructuring keyword, trial is crawled again, it is ensured that each keyword can climb to data without being missed;Will Keyword is grouped, and is added to and crawls queue, and the form of queue can be distributed queue's component, or data The form in storehouse;Reptile actively applies for task, such that it is able to crawl speed according to actual conditions control;Identical keyword Can merge, to reduce total amount is crawled.
It should be noted that can be in such as one group computer executable instructions the step of the flow process of accompanying drawing is illustrated Perform in computer system, and, although show logical order in flow charts, but in some cases, can With with different from the shown or described step of order execution herein.
According to the another aspect of the embodiment of the present application, there is provided a kind of web page crawl device, the web page crawl device can be with For performing the web page crawl method of the embodiment of the present application, the web page crawl method of the embodiment of the present application can also be by this The web page crawl device of application embodiment is performing.
Fig. 3 is the schematic diagram of the web page crawl device according to the embodiment of the present application, as shown in figure 3, the device includes: Acquiring unit 10 and crawl unit 20.
Acquiring unit 10, for making multiple servers obtain crucial phrase from task queue respectively, wherein, task team Be stored with multiple crucial phrases to be crawled in row, and each crucial phrase to be crawled includes multiple keywords.
Alternatively, multiple servers include first server, and acquiring unit 10 includes:Detection module, for making first Whether there is crucial phrase to be crawled in server Detection task queue;Locking module, for making first server exist Detect and exist in task queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking Only can be read by first server;And acquisition module, for making first server obtain from the task queue of locking Crucial phrase is taken, and discharges task queue, wherein, the task queue after release can be by any one in multiple servers Platform server reads.
Unit 20 is crawled, for making multiple servers crawl by respective web crawlers respectively in the crucial phrase of acquisition The corresponding search-engine results page of each keyword.
Alternatively, multiple servers include first server, and the crucial phrase that first server is obtained is the first keyword Group, the web crawlers of first server is first network reptile, and crawling unit 20 includes:Module is crawled, for traveling through First crucial phrase, by first network reptile the corresponding search engine knot of each keyword in the first crucial phrase is crawled Fruit page;Judge module, judges that first network reptile crawls the corresponding search engine of each keyword in the first crucial phrase Whether result page is successful;And add module, for judging there is the keyword pair crawled in the first crucial phrase During the situation of the search-engine results page failure answered, the keyword that failure is crawled in the first crucial phrase is added to failure List.
The embodiment of the present application makes multiple servers obtain crucial phrase from task queue respectively by acquiring unit 10, its In, be stored with multiple crucial phrases to be crawled in task queue, and each crucial phrase to be crawled includes multiple keys Word;And crawling unit 20 makes multiple servers crawl by respective web crawlers respectively in the crucial phrase of acquisition often The corresponding search-engine results page of individual keyword, the embodiment of the present application crawls in a distributed manner keyword by multiple servers Corresponding search-engine results page, such that it is able to improve the efficiency for crawling the corresponding search-engine results page of keyword, Can reduce triggering the possibility of the anti-reptile strategy of search engine, solve in correlation technique by the net of single server Network reptile crawls problem less efficient during keyword search engine result page, and then has reached raising and crawl keyword pair The efficiency effect of the search-engine results page answered.
Preferably, the device also includes:Packaged unit, for the keyword in failed list to be packaged as into new key Phrase;And adding device, for new crucial phrase to be added into task queue.
The web page crawl device includes processor and memory, above-mentioned acquiring unit and crawls unit etc. as program Unit is stored in memory, and by computing device storage said procedure unit in memory corresponding work(is realized Energy.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can arrange one Or more, crawl the corresponding search-engine results page of keyword by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment, is adapted for carrying out just The program code of beginningization there are as below methods step:Multiple servers obtain respectively crucial phrase from task queue, wherein, Be stored with multiple crucial phrases to be crawled in task queue, and each crucial phrase to be crawled includes multiple keywords; And multiple servers crawl respectively in the crucial phrase of acquisition that each keyword is corresponding to be searched by respective web crawlers Index holds up result page.
Above-mentioned the embodiment of the present application sequence number is for illustration only, does not represent the quality of embodiment.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, other can be passed through Mode realize.Wherein, device embodiment described above is only schematic, such as division of described unit, Can be a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Can with reference to or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, institute The coupling each other for showing or discussing or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme Purpose.
In addition, each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used When, during a computer read/write memory medium can be stored in.Based on such understanding, the technical scheme of the application The part for substantially contributing to prior art in other words or all or part of the technical scheme can be produced with software The form of product is embodied, and the computer software product is stored in a storage medium, including some instructions are to make Obtain a computer equipment (can be personal computer, server or network equipment etc.) and perform each enforcement of the application The all or part of step of example methods described.And aforesaid storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, Magnetic disc or CD etc. are various can be with the medium of store program codes.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, on the premise of without departing from the application principle, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims (10)

1. a kind of web page crawl method, it is characterised in that include:
Multiple servers obtain respectively crucial phrase from task queue, wherein, it is stored with the task queue Multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords;And
The multiple servers crawl respectively each keyword in the crucial phrase of acquisition by respective web crawlers Corresponding search-engine results page.
2. method according to claim 1, it is characterised in that the multiple servers include first server, many Platform server obtains respectively crucial phrase from task queue includes the first server from the task queue Crucial phrase is obtained, the first server obtains crucial phrase from the task queue to be included:
The first server is detected and whether there is in the task queue crucial phrase to be crawled;
The first server locks institute when existing in detecting the task queue wait the crucial phrase for crawling Task queue is stated, wherein, the task queue of locking only can be read by the first server;And
The first server obtains the crucial phrase from the task queue of locking, and discharges described appointing Business queue, wherein, the task queue after release can be by any one server in the multiple servers Read.
3. method according to claim 1, it is characterised in that the multiple servers include first server, institute The crucial phrase for stating first server acquisition is the first crucial phrase, and the web crawlers of the first server is the One web crawlers, the multiple servers are crawled respectively every in the crucial phrase of acquisition by respective web crawlers The corresponding search-engine results page of individual keyword is crawled including the first server by the first network reptile The corresponding search-engine results page of each keyword in first crucial phrase, the first server passes through institute State first network reptile and crawl the corresponding search-engine results page bag of each keyword in first crucial phrase Include:
First crucial phrase is traveled through, is crawled by the first network reptile every in first crucial phrase The corresponding search-engine results page of individual keyword;
Judge that the first network reptile crawls the corresponding search engine of each keyword in first crucial phrase Whether result page is successful;And
Judging there is the corresponding search-engine results page failure of the keyword crawled in first crucial phrase Situation when, the keyword that failure is crawled in first crucial phrase is added to failed list.
4. method according to claim 3, it is characterised in that failure will crawled in first crucial phrase Keyword adds to failed list, and methods described also includes:
Keyword in the failed list is packaged as into new crucial phrase;And
The new crucial phrase is added into the task queue.
5. method according to claim 4, it is characterised in that be packaged as the keyword in the failed list newly Crucial phrase include:
Obtain the number of retries of keyword in the failed list;
Judge the number of retries of keyword in the failed list whether less than preset value;And
When the number of retries of keyword in judging the failed list is less than the preset value, by the failure Keyword is packaged as new crucial phrase in list.
6. method according to claim 1, it is characterised in that obtain from task queue respectively in multiple servers Before crucial phrase, methods described also includes:
Multiple keywords are grouped according to preset rules, obtain the crucial phrase of multiple groups;And
The crucial phrase of the plurality of group is stored in the task queue according to priority.
7. a kind of web page crawl device, it is characterised in that include:
Acquiring unit, for making multiple servers obtain crucial phrase from task queue respectively, wherein, it is described Be stored with multiple crucial phrases to be crawled in task queue, and each crucial phrase to be crawled includes multiple keys Word;And
Unit is crawled, for making the multiple servers crawl the key of acquisition by respective web crawlers respectively The corresponding search-engine results page of each keyword in phrase.
8. device according to claim 7, it is characterised in that the multiple servers include first server, institute Stating acquiring unit includes:
Detection module, for making the first server detect in the task queue with the presence or absence of pass to be crawled Keyword group;
, there is pass to be crawled in the task queue is detected for making the first server in locking module During keyword group, the task queue is locked, wherein, the task queue of locking only can be taken by described first Business device reads;And
Acquisition module, for making the first server obtain the keyword from the task queue of locking Group, and the task queue is discharged, wherein, the task queue after release can be by the multiple servers In any one server read.
9. device according to claim 7, it is characterised in that the multiple servers include first server, institute The crucial phrase for stating first server acquisition is the first crucial phrase, and the web crawlers of the first server is the One web crawlers, the unit that crawls includes:
Module is crawled, for traveling through first crucial phrase, by the first network reptile described is crawled The corresponding search-engine results page of each keyword in one crucial phrase;
Judge module, judges that the first network reptile crawls each keyword correspondence in first crucial phrase The whether success of search-engine results page;And
Add module, for judging there is the corresponding search of keyword crawled in first crucial phrase During the situation of engine results page failure, the keyword that failure is crawled in first crucial phrase is added to failure List.
10. device according to claim 9, it is characterised in that described device also includes:
Packaged unit, for the keyword in the failed list to be packaged as into new crucial phrase;And
Adding device, for the new crucial phrase to be added into the task queue.
CN201510729544.6A 2015-10-30 2015-10-30 Webpage crawling method and device Active CN106649362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510729544.6A CN106649362B (en) 2015-10-30 2015-10-30 Webpage crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510729544.6A CN106649362B (en) 2015-10-30 2015-10-30 Webpage crawling method and device

Publications (2)

Publication Number Publication Date
CN106649362A true CN106649362A (en) 2017-05-10
CN106649362B CN106649362B (en) 2020-02-07

Family

ID=58809462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510729544.6A Active CN106649362B (en) 2015-10-30 2015-10-30 Webpage crawling method and device

Country Status (1)

Country Link
CN (1) CN106649362B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133217A (en) * 2017-05-26 2017-09-05 北京惠商之星网络科技有限公司 Target topic intelligent grabbing method, system and computer-readable recording medium
CN108170551A (en) * 2018-01-03 2018-06-15 深圳壹账通智能科技有限公司 Front and back end error handling method, server and storage medium based on crawler system
CN109657462A (en) * 2018-12-06 2019-04-19 江苏满运软件科技有限公司 Data detection method, system, electronic equipment and storage medium
CN109815380A (en) * 2018-12-20 2019-05-28 山东中创软件工程股份有限公司 A kind of information crawler method, apparatus, equipment and computer readable storage medium
CN110020041A (en) * 2017-08-21 2019-07-16 北京国双科技有限公司 A kind of method and device tracking the process that crawls
CN110147473A (en) * 2017-08-28 2019-08-20 北京国双科技有限公司 A kind of crawling method and device of crawler
CN110287444A (en) * 2019-07-02 2019-09-27 郑州悉知信息科技股份有限公司 Website detection method, device and storage medium
CN110928711A (en) * 2019-11-26 2020-03-27 多点(深圳)数字科技有限公司 Task processing method, device, system, server and storage medium
CN111460254A (en) * 2020-03-24 2020-07-28 南阳柯丽尔科技有限公司 Webpage crawling method, device, storage medium and equipment based on multithreading
CN113239253A (en) * 2021-04-09 2021-08-10 北京皮尔布莱尼软件有限公司 Web crawler implementation method, system, computing device and storage medium
US11941073B2 (en) 2019-12-23 2024-03-26 97th Floor Generating and implementing keyword clusters

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334784A (en) * 2008-07-30 2008-12-31 施章祖 Computer auxiliary report and knowledge base generation method
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
CN101916291A (en) * 2010-08-26 2010-12-15 北京大学 Method for crawling eDonkey network shared file and client information
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104077402A (en) * 2014-07-04 2014-10-01 用友软件股份有限公司 Data processing method and data processing system
CN104199830A (en) * 2014-07-31 2014-12-10 渠成 Search engine optimization big data management platform
WO2015039165A1 (en) * 2013-09-19 2015-03-26 Longtail Ux Pty Ltd Improvements in website traffic optimization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334784A (en) * 2008-07-30 2008-12-31 施章祖 Computer auxiliary report and knowledge base generation method
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
CN101916291A (en) * 2010-08-26 2010-12-15 北京大学 Method for crawling eDonkey network shared file and client information
WO2015039165A1 (en) * 2013-09-19 2015-03-26 Longtail Ux Pty Ltd Improvements in website traffic optimization
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104077402A (en) * 2014-07-04 2014-10-01 用友软件股份有限公司 Data processing method and data processing system
CN104199830A (en) * 2014-07-31 2014-12-10 渠成 Search engine optimization big data management platform

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133217A (en) * 2017-05-26 2017-09-05 北京惠商之星网络科技有限公司 Target topic intelligent grabbing method, system and computer-readable recording medium
CN110020041A (en) * 2017-08-21 2019-07-16 北京国双科技有限公司 A kind of method and device tracking the process that crawls
CN110147473B (en) * 2017-08-28 2022-03-01 北京国双科技有限公司 Crawling method and device for crawler
CN110147473A (en) * 2017-08-28 2019-08-20 北京国双科技有限公司 A kind of crawling method and device of crawler
CN108170551A (en) * 2018-01-03 2018-06-15 深圳壹账通智能科技有限公司 Front and back end error handling method, server and storage medium based on crawler system
CN109657462A (en) * 2018-12-06 2019-04-19 江苏满运软件科技有限公司 Data detection method, system, electronic equipment and storage medium
CN109815380A (en) * 2018-12-20 2019-05-28 山东中创软件工程股份有限公司 A kind of information crawler method, apparatus, equipment and computer readable storage medium
CN110287444A (en) * 2019-07-02 2019-09-27 郑州悉知信息科技股份有限公司 Website detection method, device and storage medium
CN110287444B (en) * 2019-07-02 2021-06-25 郑州悉知信息科技股份有限公司 Website detection method and device and storage medium
CN110928711A (en) * 2019-11-26 2020-03-27 多点(深圳)数字科技有限公司 Task processing method, device, system, server and storage medium
US11941073B2 (en) 2019-12-23 2024-03-26 97th Floor Generating and implementing keyword clusters
CN111460254A (en) * 2020-03-24 2020-07-28 南阳柯丽尔科技有限公司 Webpage crawling method, device, storage medium and equipment based on multithreading
CN111460254B (en) * 2020-03-24 2023-05-05 南阳柯丽尔科技有限公司 Webpage crawling method and device based on multithreading, storage medium and equipment
CN113239253A (en) * 2021-04-09 2021-08-10 北京皮尔布莱尼软件有限公司 Web crawler implementation method, system, computing device and storage medium
CN113239253B (en) * 2021-04-09 2024-02-23 北京皮尔布莱尼软件有限公司 Method, system, computing device and storage medium for realizing web crawler

Also Published As

Publication number Publication date
CN106649362B (en) 2020-02-07

Similar Documents

Publication Publication Date Title
CN106649362A (en) Webpage crawling method and apparatus
US8849826B2 (en) Sentiment analysis from social media content
Wan et al. CollabRank: towards a collaborative approach to single-document keyphrase extraction
US8554759B1 (en) Selection of documents to place in search index
Noll et al. Telling experts from spammers: expertise ranking in folksonomies
CN106708841B (en) The polymerization and device of website visitation path
CN111324797B (en) Method and device for precisely acquiring data at high speed
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
US20110307479A1 (en) Automatic Extraction of Structured Web Content
CN110222260A (en) A kind of searching method, device and storage medium
GC et al. Why big data industrial systems need rules and what we can do about it
CN106815265A (en) The searching method and device of judgement document
Bingol et al. Rumor Detection in Social Media using machine learning methods
US9424340B1 (en) Detection of proxy pad sites
Tsuchiya et al. Interactive recovery of requirements traceability links using user feedback and configuration management logs
Kopliku et al. Towards a framework for attribute retrieval
CN106611029A (en) Method and device for improving site search efficiency in website
Gossen et al. Extracting event-centric document collections from large-scale web archives
Oza et al. Elimination of noisy information from web pages
CN107526833A (en) A kind of URL management methods, system
KR101556714B1 (en) Method, system and computer readable recording medium for providing search results
Breja A novel approach for novelty detection of web documents
Tourné et al. Evaluating tag filtering techniques for web resource classification in folksonomies
CN106611022A (en) Method and device for increasing website search efficiency
Abdelouarit et al. Towards an approach based on hadoop to improve and organize online search results in big data environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant