CN106649362A - Webpage crawling method and apparatus - Google Patents
Webpage crawling method and apparatus Download PDFInfo
- Publication number
- CN106649362A CN106649362A CN201510729544.6A CN201510729544A CN106649362A CN 106649362 A CN106649362 A CN 106649362A CN 201510729544 A CN201510729544 A CN 201510729544A CN 106649362 A CN106649362 A CN 106649362A
- Authority
- CN
- China
- Prior art keywords
- keyword
- crucial phrase
- task queue
- server
- crawled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a webpage crawling method and apparatus. The method comprises the steps that a plurality of servers obtain keyword sets from a task queue, wherein the task queue stores a plurality of to-be-crawled keyword sets, and each to-be-crawled keyword set contains a plurality of keywords; and the servers crawl search engine result pages corresponding to all the keywords in the obtained keyword sets through respective network crawlers. According to the method and the apparatus, the technical problem of relatively low efficiency of crawling keyword search engine result pages through a network crawler of a single server in related technologies is solved.
Description
Technical field
The application is related to internet arena, in particular to a kind of web page crawl method and apparatus.
Background technology
In traditional search engine optimization (Search Engine Optimization, referred to as SEO) business, generally need
Help customer analysis keyword ranking in a search engine.Generally, user can preset one group of keyword, periodically logical
Cross web crawlers to go to crawl these keywords ranking in a search engine, i.e., keyword correspondence is crawled by web crawlers
Search-engine results page, wherein, the corresponding search-engine results page of keyword is referred in search engine (for example, hundred
The search engines such as degree, search dog) the middle result of page searching for being input into display after keyword.
But, search engine in order to prevent robot (for example, web crawlers) access, or reduce abnormal access
Flow, is often limited the search speed or searching times of single ip address (i.e. anti-reptile strategy), and
The keyword that generally user specifies is added up and can reach a very big quantity, therefore, only by single machine or
Person's IP address carry out keyword search engine result page crawl not only crawl it is less efficient, and easily because of search engine
Anti- reptile strategy lead to not crawl the corresponding search-engine results page of all keywords.
During for crawling the corresponding search-engine results page of keyword by the web crawlers of single server in correlation technique
Less efficient problem, not yet proposes at present effective solution.
The content of the invention
The main purpose of the application is to provide a kind of web page crawl method and apparatus, to solve correlation technique in by single
The web crawlers of platform server crawls problem less efficient during keyword search engine result page.
To achieve these goals, according to the one side of the application, there is provided a kind of web page crawl method.The method
Including:Multiple servers obtain respectively crucial phrase from task queue, wherein, be stored with multiple treating in task queue
The crucial phrase for crawling, each crucial phrase to be crawled includes multiple keywords;And multiple servers pass through respectively
Respective web crawlers crawls the corresponding search-engine results page of each keyword in the crucial phrase of acquisition.
Further, multiple servers include first server, and multiple servers obtain crucial from task queue respectively
Phrase obtains crucial phrase including first server from task queue, and first server obtains key from task queue
Phrase includes:Whether there is crucial phrase to be crawled in first server Detection task queue;First server is in inspection
Measure and exist in task queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking is only
Can be read by first server;And first server obtains crucial phrase from the task queue of locking, and discharge
Task queue, wherein, the task queue after release can be read by any one server in multiple servers.
Further, multiple servers include first server, and the crucial phrase that first server is obtained is first crucial
Phrase, the web crawlers of first server is first network reptile, and multiple servers pass through respectively respective web crawlers
Crawl the corresponding search-engine results page of each keyword in the crucial phrase of acquisition and pass through the first net including first server
Network reptile crawls the corresponding search-engine results page of each keyword in the first crucial phrase, and first server passes through first
Web crawlers crawls the corresponding search-engine results page of each keyword in the first crucial phrase to be included:Travel through first crucial
Phrase, by first network reptile the corresponding search-engine results page of each keyword in the first crucial phrase is crawled;Sentence
Whether disconnected first network reptile crawls the corresponding search-engine results page of each keyword in the first crucial phrase successful;With
And when judging to exist the situation of the corresponding search-engine results page failure of the keyword crawled in the first crucial phrase,
The keyword that failure is crawled in first crucial phrase is added to failed list.
Further, the keyword that failure is crawled in the first crucial phrase is being added to failed list, the method
Also include:Keyword in failed list is packaged as into new crucial phrase;And by new crucial phrase add to appoint
In business queue.
Further, the keyword in failed list is packaged as into new crucial phrase includes:Obtain in failed list and close
The number of retries of keyword;Judge the number of retries of keyword in failed list whether less than preset value;And judging
When the number of retries of keyword is less than preset value in failed list, keyword in failed list is packaged as into new keyword
Group.
Further, before multiple servers obtain crucial phrase from task queue respectively, method also includes:Press
Multiple keywords are grouped according to preset rules, obtain the crucial phrase of multiple groups;And by the pass of multiple groups
Keyword group is stored in task queue according to priority.
To achieve these goals, according to the another aspect of the application, there is provided a kind of web page crawl device.The device
Including:Acquiring unit, for making multiple servers obtain crucial phrase from task queue respectively, wherein, task team
Be stored with multiple crucial phrases to be crawled in row, and each crucial phrase to be crawled includes multiple keywords;And climb
Unit is taken, for making multiple servers crawl each key in the crucial phrase of acquisition by respective web crawlers respectively
The corresponding search-engine results page of word.
Further, multiple servers include first server, and acquiring unit includes:Detection module, for making first
Whether there is crucial phrase to be crawled in server Detection task queue;Locking module, for making first server exist
Detect and exist in task queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking
Only can be read by first server;And acquisition module, for making first server obtain from the task queue of locking
Crucial phrase is taken, and discharges task queue, wherein, the task queue after release can be by any one in multiple servers
Platform server reads.
Further, multiple servers include first server, and the crucial phrase that first server is obtained is first crucial
Phrase, the web crawlers of first server is first network reptile, and crawling unit includes:Module is crawled, for traveling through
First crucial phrase, by first network reptile the corresponding search engine knot of each keyword in the first crucial phrase is crawled
Fruit page;Judge module, judges that first network reptile crawls the corresponding search engine of each keyword in the first crucial phrase
Whether result page is successful;And add module, for judging there is the keyword pair crawled in the first crucial phrase
During the situation of the search-engine results page failure answered, the keyword that failure is crawled in the first crucial phrase is added to failure
List.
Further, the device also includes:Packaged unit, for the keyword in failed list to be packaged as into new pass
Keyword group;And adding device, for new crucial phrase to be added into task queue.
The application obtains respectively crucial phrase by multiple servers from task queue, wherein, store in task queue
There are multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords;And multiple servers
Crawl the corresponding search-engine results page of each keyword in the crucial phrase of acquisition by respective web crawlers respectively,
The application crawls in a distributed manner the corresponding search-engine results page of keyword by multiple servers, climbs such that it is able to improve
Take the efficiency of the corresponding search-engine results page of keyword, it is also possible to reduce the possibility of the anti-reptile strategy of triggering search engine
Property, solve efficiency when crawling keyword search engine result page by the web crawlers of single server in correlation technique
Relatively low problem, and then reached the efficiency effect that raising crawls the corresponding search-engine results page of keyword.
Description of the drawings
The accompanying drawing for constituting the part of the application is used for providing further understanding of the present application, the schematic reality of the application
Apply example and its illustrate for explaining the application, do not constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the flow chart of the web page crawl method according to the embodiment of the present application;
Fig. 2 is the distributed schematic diagram for crawling webpage according to the embodiment of the present application;And
Fig. 3 is the schematic diagram of the web page crawl device according to the embodiment of the present application.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combination.Below with reference to the accompanying drawings and in conjunction with the embodiments describing the application in detail.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application
Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment
The only embodiment of the application part, rather than the embodiment of whole.Based on the embodiment in the application, ability
The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, all should belong to
The scope of the application protection.
It should be noted that the description and claims of this application and the term " first " in above-mentioned accompanying drawing, "
Two " it is etc. the object for distinguishing similar, without for describing specific order or precedence.It should be appreciated that this
The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein.Additionally, term
" comprising " and " having " and their any deformation, it is intended that covering is non-exclusive to be included, for example, comprising
The process of series of steps or unit, method, system, product or equipment are not necessarily limited to those steps clearly listed
Rapid or unit, but may include clearly not listing or intrinsic for these processes, method, product or equipment
Other steps or unit.
For the ease of description, some concepts being related to the application below are illustrated:
Search engine optimization, i.e. Search Engine Optimization, referred to as SEO.Search engine optimization is one
Plant the mode that ranking of the targeted website in about search engine is improved using the search rule of search engine.
Queue, is a kind of special linear list, and it allows to carry out deletion action in the front end (front) of table, and
The rear end (rear) of table carries out insertion operation, the also referred to as linear list of first in first out (First In First Out), referred to as
For FIFO tables, the form of the queue of the embodiment of the present application can be distributed queue's component, it would however also be possible to employ database
Form.
Web crawlers:Be otherwise known as webpage spider or network robot, is that one kind captures ten thousand dimensions automatically according to preset rules
The program or script of net information.
According to the embodiment of the present application, there is provided a kind of web page crawl method.Fig. 1 is the webpage according to the embodiment of the present application
The flow chart of crawling method, as shown in figure 1, the method includes steps S102 to step S104:
Step S102, multiple servers obtain respectively crucial phrase from task queue, wherein, store in task queue
There are multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords.
Specifically, multiple keywords of user preset can be grouped by scheduler, and will be obtained after packet
Crucial phrase is positioned in task queue.Preferably, it is corresponding in order to ensure preferentially to crawl the high keyword of significance level
Search-engine results page, before multiple servers obtain crucial phrase from task queue respectively, the method also includes:
Multiple keywords are grouped according to preset rules, obtain the crucial phrase of multiple groups;And by multiple groups
Crucial phrase is stored in task queue according to priority.
Specifically, the embodiment of the present application can to multiple keywords according to significance level (for example, it is desired to preferred process
Keyword significance level is high) it is ranked up, and according to the crucial phrase for taking predetermined number successively that sorts into one group, for example,
Altogether 300 keywords, this 300 keywords are ranked up according to significance level, and according to this 300 keys
The sequence of word, takes successively 50 crucial phrases into a crucial phrase, can obtain 6 crucial phrases, and by this 6
Individual crucial phrase is added into task queue according to priority, wherein, the high crucial phrase priority of significance level is high,
The low crucial phrase priority of significance level is low.
Multiple servers of the embodiment of the present application obtain crucial phrase from task queue and carry out web page crawl task respectively,
As shown in Fig. 2 all keywords are divided into N groups, i.e. crucial phrase 1 to crucial phrase N by scheduler, three clothes
Business device (i.e. server 1, server 2 and server 3) obtains successively crucial phrase from task queue, for example, clothes
Business device 1 obtains crucial phrase 1, server 2 and obtains crucial phrase 2, the acquisition crucial phrase 3 of server 3, each clothes
When the crawling task of crucial phrase that business device gets in process, can be recorded the state of current key phrase, example
Such as, record current key phrase task status (for example, crawl successfully, crawl failure, wait, crawl it is medium),
Number of retries (for example, its initial value can be set to 0, plus 1 the number of times that reattempts on failure is crawled) etc..Processing
When crawling task of each keyword in the crucial phrase for getting, obtains new keyword from task queue again
Group is processed, by that analogy.
It should be noted that the embodiment of the present application can be merged to identical keyword in multiple keywords, to subtract
Total amount is crawled less.The embodiment of the present application can also adopt multiple tasks queue, and different strategy is crawled (for example, to match
Different search engine, crawl limit etc.), for example, task queue 1 is used for keyword of the storage based on Baidu search,
Task queue 2 is used for keyword of the storage based on search dog search.
Alternatively, multiple servers include first server, and multiple servers obtain respectively keyword from task queue
Group includes that first server obtains crucial phrase from task queue, and first server obtains keyword from task queue
Group includes:Whether there is crucial phrase to be crawled in first server Detection task queue;First server is in detection
Go out on missions and exist in queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking is only capable of
It is enough to be read by first server;And first server obtains crucial phrase from the task queue of locking, and release is appointed
Business queue, wherein, the task queue after release can be read by any one server in multiple servers.
The first server of the embodiment of the present application can be any one server in multiple servers.Specifically, first
Server can whether there is crucial phrase to be crawled in first Detection task queue, not deposit in task queue is detected
When wait the crucial phrase for crawling, then wait and whether there is in Detection task queue again after Preset Time pass to be crawled
Keyword group, when existing in detecting task queue wait the crucial phrase for crawling, then locks task queue, so that
Other servers cannot now access the task queue, be clashed with avoiding multithreading from reading task queue simultaneously,
First server reads after crucial phrase from task queue, and release task queue (is solved to task queue
Lock) so that all of server can conduct interviews to the task queue.
Step S104, it is crucial that multiple servers crawl respectively in the crucial phrase of acquisition each by respective web crawlers
The corresponding search-engine results page of word.
The multiple servers of the embodiment of the present application are provided with web crawlers, and in multiple servers keyword is got respectively
After group, multiple servers crawl the corresponding search of each keyword in the crucial phrase of acquisition by respective web crawlers
Engine results page, wherein, the corresponding search-engine results page of keyword refers to that (for example, Baidu, searches in search engine
The search engines such as dog) the middle result of page searching for being input into display after keyword.Illustrate as a example by shown in Fig. 2, take
After crucial phrase 1 is got, server 1 crawls each key in crucial phrase 1 to business device 1 by its web crawlers
The corresponding search-engine results page of word, server 2 and server 3 crawl webpage process with server 1.
The embodiment of the present application is by the corresponding search-engine results page of the multiple keywords of the distributed process of multiple servers
Task is crawled, the efficiency for crawling the corresponding search-engine results page of multiple keywords on the one hand can be improved, on the other hand
When keyword quantity is excessive, the possibility of the anti-reptile strategy of triggering search engine can be reduced, to ensure to get
The corresponding search-engine results page of all keywords.
The embodiment of the present application obtains respectively crucial phrase by multiple servers from task queue, wherein, task queue
In be stored with multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords;And multiple stage
Server crawls respectively the corresponding search engine knot of each keyword in the crucial phrase of acquisition by respective web crawlers
Fruit page, the embodiment of the present application crawls in a distributed manner the corresponding search-engine results page of keyword by multiple servers, from
And the efficiency for crawling the corresponding search-engine results page of keyword can be improved, it is also possible to reducing triggering, search engine is counter climbs
The possibility of worm strategy, solves in correlation technique and crawls keyword search engine by the web crawlers of single server
Less efficient problem during result page, and then reached the efficiency that raising crawls the corresponding search-engine results page of keyword
Effect.
Preferably, in order to avoid the keyword for crawling failure is missed, multiple servers include first server, first
The crucial phrase that server is obtained is the first crucial phrase, and the web crawlers of first server is first network reptile, many
Platform server crawls respectively the corresponding search engine of each keyword in the crucial phrase of acquisition by respective web crawlers
Result page crawls the corresponding search of each keyword in the first crucial phrase including first server by first network reptile
Engine results page, first server crawls in the first crucial phrase that each keyword is corresponding to be searched by first network reptile
Index holds up result page to be included:The first crucial phrase is traveled through, each in the first crucial phrase is crawled by first network reptile
The corresponding search-engine results page of keyword;Judge that first network reptile crawls each keyword pair in the first crucial phrase
Whether the search-engine results page answered is successful;And judging that there is the keyword crawled in the first crucial phrase corresponds to
Search-engine results page failure situation when, the keyword that failure is crawled in the first crucial phrase is added to unsuccessfully arranging
Table.
The embodiment of the present application is illustrated by taking first server as an example, and specifically, first server travels through the first keyword
Each keyword in group, and it is corresponding to crawl each keyword by its web crawlers (i.e. first network reptile)
Search-engine results page, but, due to Network Abnormal, server exception, data parsing exception and the anti-reptile of triggering etc.
Reason can cause web crawlers to crawl the corresponding search-engine results page failure of keyword, that is, fail and get key
The corresponding search-engine results page of word.Therefore, the embodiment of the present application detects to the result that crawls of web crawlers, such as
Fruit web crawlers successfully crawls the corresponding search-engine results page of each keyword in crucial phrase, records the key
Phrase does well as success, if there is crawling in crucial phrase certain or some keyword corresponding search engine knots
When fruit page fails, then the keyword for crawling failure is added into failed list, to carry out obtaining the keyword of failure
Mark, and be failure by the process state recording of the crucial phrase, and number of retries adds 1.The embodiment of the present application is led to
Cross the keyword to crawling failure to record, the keyword for crawling failure can be avoided to be missed.
It should be noted that when certain server crawls the corresponding search-engine results page of keyword and fails, Ke Yirang
The task of crawling is re-executed again after the server dormancy Preset Time.
Preferably, the keyword that failure is crawled in the first crucial phrase is being added to failed list, the method is also
Including:Keyword in failed list is packaged as into new crucial phrase;And add new crucial phrase to task
In queue.
The embodiment of the present application is adding the keyword for crawling failure to failed list, obtains from failed list and climbs
The keyword of failure is taken, and the keyword for crawling failure is repacked is stored in task team for a new crucial phrase
In row, so as to the keyword for crawling failure can be crawled again, such that it is able to avoid crawling the keyword pair of failure
The data answered are not missed, you can to ensure to crawl the corresponding search-engine results page of whole keywords.
Preferably, the keyword in failed list is packaged as into new crucial phrase includes:Obtain crucial in failed list
The number of retries of word;Judge the number of retries of keyword in failed list whether less than preset value;And judging to lose
When the number of retries for losing keyword in list is less than preset value, keyword in failed list is packaged as into new crucial phrase.
In actual conditions, some keywords still fail after may being repeatedly crawled and get its corresponding search engine
Result page, in order to save system resource, can stop crawling task to these keywords, be gone by the mode such as manual
Obtain the corresponding search-engine results page of these keywords.
Specifically, the application is lost due to pre-recorded the number of retries of crucial phrase (i.e. the frequency of failure) by obtaining
Lose in list the number of retries of keyword and be compared with preset value, to failure if number of retries is less than preset value
Keyword in list is packed and is stored into task queue, does not carry out beating if number of retries is more than preset value
Bag process.
As can be seen from the above description, the embodiment of the present application is for substantial amounts of keyword, by what is broken the whole up into parts
Mode, is assigned on the machine of different IP addresses (i.e. server), reaches the distributed purpose for crawling, while can be with
Reduce the possibility of the anti-reptile of triggering;(anti-reptile mechanism triggering is often as in the case of crawling unsuccessfully), pass through
The mode of restructuring keyword, trial is crawled again, it is ensured that each keyword can climb to data without being missed;Will
Keyword is grouped, and is added to and crawls queue, and the form of queue can be distributed queue's component, or data
The form in storehouse;Reptile actively applies for task, such that it is able to crawl speed according to actual conditions control;Identical keyword
Can merge, to reduce total amount is crawled.
It should be noted that can be in such as one group computer executable instructions the step of the flow process of accompanying drawing is illustrated
Perform in computer system, and, although show logical order in flow charts, but in some cases, can
With with different from the shown or described step of order execution herein.
According to the another aspect of the embodiment of the present application, there is provided a kind of web page crawl device, the web page crawl device can be with
For performing the web page crawl method of the embodiment of the present application, the web page crawl method of the embodiment of the present application can also be by this
The web page crawl device of application embodiment is performing.
Fig. 3 is the schematic diagram of the web page crawl device according to the embodiment of the present application, as shown in figure 3, the device includes:
Acquiring unit 10 and crawl unit 20.
Acquiring unit 10, for making multiple servers obtain crucial phrase from task queue respectively, wherein, task team
Be stored with multiple crucial phrases to be crawled in row, and each crucial phrase to be crawled includes multiple keywords.
Alternatively, multiple servers include first server, and acquiring unit 10 includes:Detection module, for making first
Whether there is crucial phrase to be crawled in server Detection task queue;Locking module, for making first server exist
Detect and exist in task queue when the crucial phrase for crawling, lock task queue, wherein, the task queue of locking
Only can be read by first server;And acquisition module, for making first server obtain from the task queue of locking
Crucial phrase is taken, and discharges task queue, wherein, the task queue after release can be by any one in multiple servers
Platform server reads.
Unit 20 is crawled, for making multiple servers crawl by respective web crawlers respectively in the crucial phrase of acquisition
The corresponding search-engine results page of each keyword.
Alternatively, multiple servers include first server, and the crucial phrase that first server is obtained is the first keyword
Group, the web crawlers of first server is first network reptile, and crawling unit 20 includes:Module is crawled, for traveling through
First crucial phrase, by first network reptile the corresponding search engine knot of each keyword in the first crucial phrase is crawled
Fruit page;Judge module, judges that first network reptile crawls the corresponding search engine of each keyword in the first crucial phrase
Whether result page is successful;And add module, for judging there is the keyword pair crawled in the first crucial phrase
During the situation of the search-engine results page failure answered, the keyword that failure is crawled in the first crucial phrase is added to failure
List.
The embodiment of the present application makes multiple servers obtain crucial phrase from task queue respectively by acquiring unit 10, its
In, be stored with multiple crucial phrases to be crawled in task queue, and each crucial phrase to be crawled includes multiple keys
Word;And crawling unit 20 makes multiple servers crawl by respective web crawlers respectively in the crucial phrase of acquisition often
The corresponding search-engine results page of individual keyword, the embodiment of the present application crawls in a distributed manner keyword by multiple servers
Corresponding search-engine results page, such that it is able to improve the efficiency for crawling the corresponding search-engine results page of keyword,
Can reduce triggering the possibility of the anti-reptile strategy of search engine, solve in correlation technique by the net of single server
Network reptile crawls problem less efficient during keyword search engine result page, and then has reached raising and crawl keyword pair
The efficiency effect of the search-engine results page answered.
Preferably, the device also includes:Packaged unit, for the keyword in failed list to be packaged as into new key
Phrase;And adding device, for new crucial phrase to be added into task queue.
The web page crawl device includes processor and memory, above-mentioned acquiring unit and crawls unit etc. as program
Unit is stored in memory, and by computing device storage said procedure unit in memory corresponding work(is realized
Energy.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can arrange one
Or more, crawl the corresponding search-engine results page of keyword by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/
Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one
Individual storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment, is adapted for carrying out just
The program code of beginningization there are as below methods step:Multiple servers obtain respectively crucial phrase from task queue, wherein,
Be stored with multiple crucial phrases to be crawled in task queue, and each crucial phrase to be crawled includes multiple keywords;
And multiple servers crawl respectively in the crucial phrase of acquisition that each keyword is corresponding to be searched by respective web crawlers
Index holds up result page.
Above-mentioned the embodiment of the present application sequence number is for illustration only, does not represent the quality of embodiment.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment
The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, other can be passed through
Mode realize.Wherein, device embodiment described above is only schematic, such as division of described unit,
Can be a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing
Can with reference to or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, institute
The coupling each other for showing or discussing or direct-coupling or communication connection can be by some interfaces, unit or mould
The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit
The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to
On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme
Purpose.
In addition, each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to
It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated
Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used
When, during a computer read/write memory medium can be stored in.Based on such understanding, the technical scheme of the application
The part for substantially contributing to prior art in other words or all or part of the technical scheme can be produced with software
The form of product is embodied, and the computer software product is stored in a storage medium, including some instructions are to make
Obtain a computer equipment (can be personal computer, server or network equipment etc.) and perform each enforcement of the application
The all or part of step of example methods described.And aforesaid storage medium includes:USB flash disk, read-only storage (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive,
Magnetic disc or CD etc. are various can be with the medium of store program codes.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art
For member, on the premise of without departing from the application principle, some improvements and modifications can also be made, these improve and moisten
Decorations also should be regarded as the protection domain of the application.
Claims (10)
1. a kind of web page crawl method, it is characterised in that include:
Multiple servers obtain respectively crucial phrase from task queue, wherein, it is stored with the task queue
Multiple crucial phrases to be crawled, each crucial phrase to be crawled includes multiple keywords;And
The multiple servers crawl respectively each keyword in the crucial phrase of acquisition by respective web crawlers
Corresponding search-engine results page.
2. method according to claim 1, it is characterised in that the multiple servers include first server, many
Platform server obtains respectively crucial phrase from task queue includes the first server from the task queue
Crucial phrase is obtained, the first server obtains crucial phrase from the task queue to be included:
The first server is detected and whether there is in the task queue crucial phrase to be crawled;
The first server locks institute when existing in detecting the task queue wait the crucial phrase for crawling
Task queue is stated, wherein, the task queue of locking only can be read by the first server;And
The first server obtains the crucial phrase from the task queue of locking, and discharges described appointing
Business queue, wherein, the task queue after release can be by any one server in the multiple servers
Read.
3. method according to claim 1, it is characterised in that the multiple servers include first server, institute
The crucial phrase for stating first server acquisition is the first crucial phrase, and the web crawlers of the first server is the
One web crawlers, the multiple servers are crawled respectively every in the crucial phrase of acquisition by respective web crawlers
The corresponding search-engine results page of individual keyword is crawled including the first server by the first network reptile
The corresponding search-engine results page of each keyword in first crucial phrase, the first server passes through institute
State first network reptile and crawl the corresponding search-engine results page bag of each keyword in first crucial phrase
Include:
First crucial phrase is traveled through, is crawled by the first network reptile every in first crucial phrase
The corresponding search-engine results page of individual keyword;
Judge that the first network reptile crawls the corresponding search engine of each keyword in first crucial phrase
Whether result page is successful;And
Judging there is the corresponding search-engine results page failure of the keyword crawled in first crucial phrase
Situation when, the keyword that failure is crawled in first crucial phrase is added to failed list.
4. method according to claim 3, it is characterised in that failure will crawled in first crucial phrase
Keyword adds to failed list, and methods described also includes:
Keyword in the failed list is packaged as into new crucial phrase;And
The new crucial phrase is added into the task queue.
5. method according to claim 4, it is characterised in that be packaged as the keyword in the failed list newly
Crucial phrase include:
Obtain the number of retries of keyword in the failed list;
Judge the number of retries of keyword in the failed list whether less than preset value;And
When the number of retries of keyword in judging the failed list is less than the preset value, by the failure
Keyword is packaged as new crucial phrase in list.
6. method according to claim 1, it is characterised in that obtain from task queue respectively in multiple servers
Before crucial phrase, methods described also includes:
Multiple keywords are grouped according to preset rules, obtain the crucial phrase of multiple groups;And
The crucial phrase of the plurality of group is stored in the task queue according to priority.
7. a kind of web page crawl device, it is characterised in that include:
Acquiring unit, for making multiple servers obtain crucial phrase from task queue respectively, wherein, it is described
Be stored with multiple crucial phrases to be crawled in task queue, and each crucial phrase to be crawled includes multiple keys
Word;And
Unit is crawled, for making the multiple servers crawl the key of acquisition by respective web crawlers respectively
The corresponding search-engine results page of each keyword in phrase.
8. device according to claim 7, it is characterised in that the multiple servers include first server, institute
Stating acquiring unit includes:
Detection module, for making the first server detect in the task queue with the presence or absence of pass to be crawled
Keyword group;
, there is pass to be crawled in the task queue is detected for making the first server in locking module
During keyword group, the task queue is locked, wherein, the task queue of locking only can be taken by described first
Business device reads;And
Acquisition module, for making the first server obtain the keyword from the task queue of locking
Group, and the task queue is discharged, wherein, the task queue after release can be by the multiple servers
In any one server read.
9. device according to claim 7, it is characterised in that the multiple servers include first server, institute
The crucial phrase for stating first server acquisition is the first crucial phrase, and the web crawlers of the first server is the
One web crawlers, the unit that crawls includes:
Module is crawled, for traveling through first crucial phrase, by the first network reptile described is crawled
The corresponding search-engine results page of each keyword in one crucial phrase;
Judge module, judges that the first network reptile crawls each keyword correspondence in first crucial phrase
The whether success of search-engine results page;And
Add module, for judging there is the corresponding search of keyword crawled in first crucial phrase
During the situation of engine results page failure, the keyword that failure is crawled in first crucial phrase is added to failure
List.
10. device according to claim 9, it is characterised in that described device also includes:
Packaged unit, for the keyword in the failed list to be packaged as into new crucial phrase;And
Adding device, for the new crucial phrase to be added into the task queue.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510729544.6A CN106649362B (en) | 2015-10-30 | 2015-10-30 | Webpage crawling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510729544.6A CN106649362B (en) | 2015-10-30 | 2015-10-30 | Webpage crawling method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649362A true CN106649362A (en) | 2017-05-10 |
CN106649362B CN106649362B (en) | 2020-02-07 |
Family
ID=58809462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510729544.6A Active CN106649362B (en) | 2015-10-30 | 2015-10-30 | Webpage crawling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649362B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133217A (en) * | 2017-05-26 | 2017-09-05 | 北京惠商之星网络科技有限公司 | Target topic intelligent grabbing method, system and computer-readable recording medium |
CN108170551A (en) * | 2018-01-03 | 2018-06-15 | 深圳壹账通智能科技有限公司 | Front and back end error handling method, server and storage medium based on crawler system |
CN109657462A (en) * | 2018-12-06 | 2019-04-19 | 江苏满运软件科技有限公司 | Data detection method, system, electronic equipment and storage medium |
CN109815380A (en) * | 2018-12-20 | 2019-05-28 | 山东中创软件工程股份有限公司 | A kind of information crawler method, apparatus, equipment and computer readable storage medium |
CN110020041A (en) * | 2017-08-21 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device tracking the process that crawls |
CN110147473A (en) * | 2017-08-28 | 2019-08-20 | 北京国双科技有限公司 | A kind of crawling method and device of crawler |
CN110287444A (en) * | 2019-07-02 | 2019-09-27 | 郑州悉知信息科技股份有限公司 | Website detection method, device and storage medium |
CN110928711A (en) * | 2019-11-26 | 2020-03-27 | 多点(深圳)数字科技有限公司 | Task processing method, device, system, server and storage medium |
CN111460254A (en) * | 2020-03-24 | 2020-07-28 | 南阳柯丽尔科技有限公司 | Webpage crawling method, device, storage medium and equipment based on multithreading |
CN113239253A (en) * | 2021-04-09 | 2021-08-10 | 北京皮尔布莱尼软件有限公司 | Web crawler implementation method, system, computing device and storage medium |
US11941073B2 (en) | 2019-12-23 | 2024-03-26 | 97th Floor | Generating and implementing keyword clusters |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101334784A (en) * | 2008-07-30 | 2008-12-31 | 施章祖 | Computer auxiliary report and knowledge base generation method |
CN101788988A (en) * | 2009-01-22 | 2010-07-28 | 蔡亮华 | Information extraction method |
CN101916291A (en) * | 2010-08-26 | 2010-12-15 | 北京大学 | Method for crawling eDonkey network shared file and client information |
CN103605764A (en) * | 2013-11-26 | 2014-02-26 | Tcl集团股份有限公司 | Web crawler system and web crawler multitask executing and scheduling method |
CN104077402A (en) * | 2014-07-04 | 2014-10-01 | 用友软件股份有限公司 | Data processing method and data processing system |
CN104199830A (en) * | 2014-07-31 | 2014-12-10 | 渠成 | Search engine optimization big data management platform |
WO2015039165A1 (en) * | 2013-09-19 | 2015-03-26 | Longtail Ux Pty Ltd | Improvements in website traffic optimization |
-
2015
- 2015-10-30 CN CN201510729544.6A patent/CN106649362B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101334784A (en) * | 2008-07-30 | 2008-12-31 | 施章祖 | Computer auxiliary report and knowledge base generation method |
CN101788988A (en) * | 2009-01-22 | 2010-07-28 | 蔡亮华 | Information extraction method |
CN101916291A (en) * | 2010-08-26 | 2010-12-15 | 北京大学 | Method for crawling eDonkey network shared file and client information |
WO2015039165A1 (en) * | 2013-09-19 | 2015-03-26 | Longtail Ux Pty Ltd | Improvements in website traffic optimization |
CN103605764A (en) * | 2013-11-26 | 2014-02-26 | Tcl集团股份有限公司 | Web crawler system and web crawler multitask executing and scheduling method |
CN104077402A (en) * | 2014-07-04 | 2014-10-01 | 用友软件股份有限公司 | Data processing method and data processing system |
CN104199830A (en) * | 2014-07-31 | 2014-12-10 | 渠成 | Search engine optimization big data management platform |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133217A (en) * | 2017-05-26 | 2017-09-05 | 北京惠商之星网络科技有限公司 | Target topic intelligent grabbing method, system and computer-readable recording medium |
CN110020041A (en) * | 2017-08-21 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device tracking the process that crawls |
CN110147473B (en) * | 2017-08-28 | 2022-03-01 | 北京国双科技有限公司 | Crawling method and device for crawler |
CN110147473A (en) * | 2017-08-28 | 2019-08-20 | 北京国双科技有限公司 | A kind of crawling method and device of crawler |
CN108170551A (en) * | 2018-01-03 | 2018-06-15 | 深圳壹账通智能科技有限公司 | Front and back end error handling method, server and storage medium based on crawler system |
CN109657462A (en) * | 2018-12-06 | 2019-04-19 | 江苏满运软件科技有限公司 | Data detection method, system, electronic equipment and storage medium |
CN109815380A (en) * | 2018-12-20 | 2019-05-28 | 山东中创软件工程股份有限公司 | A kind of information crawler method, apparatus, equipment and computer readable storage medium |
CN110287444A (en) * | 2019-07-02 | 2019-09-27 | 郑州悉知信息科技股份有限公司 | Website detection method, device and storage medium |
CN110287444B (en) * | 2019-07-02 | 2021-06-25 | 郑州悉知信息科技股份有限公司 | Website detection method and device and storage medium |
CN110928711A (en) * | 2019-11-26 | 2020-03-27 | 多点(深圳)数字科技有限公司 | Task processing method, device, system, server and storage medium |
US11941073B2 (en) | 2019-12-23 | 2024-03-26 | 97th Floor | Generating and implementing keyword clusters |
CN111460254A (en) * | 2020-03-24 | 2020-07-28 | 南阳柯丽尔科技有限公司 | Webpage crawling method, device, storage medium and equipment based on multithreading |
CN111460254B (en) * | 2020-03-24 | 2023-05-05 | 南阳柯丽尔科技有限公司 | Webpage crawling method and device based on multithreading, storage medium and equipment |
CN113239253A (en) * | 2021-04-09 | 2021-08-10 | 北京皮尔布莱尼软件有限公司 | Web crawler implementation method, system, computing device and storage medium |
CN113239253B (en) * | 2021-04-09 | 2024-02-23 | 北京皮尔布莱尼软件有限公司 | Method, system, computing device and storage medium for realizing web crawler |
Also Published As
Publication number | Publication date |
---|---|
CN106649362B (en) | 2020-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649362A (en) | Webpage crawling method and apparatus | |
US8849826B2 (en) | Sentiment analysis from social media content | |
Wan et al. | CollabRank: towards a collaborative approach to single-document keyphrase extraction | |
US8554759B1 (en) | Selection of documents to place in search index | |
Noll et al. | Telling experts from spammers: expertise ranking in folksonomies | |
CN106708841B (en) | The polymerization and device of website visitation path | |
CN111324797B (en) | Method and device for precisely acquiring data at high speed | |
US8825620B1 (en) | Behavioral word segmentation for use in processing search queries | |
US20110307479A1 (en) | Automatic Extraction of Structured Web Content | |
CN110222260A (en) | A kind of searching method, device and storage medium | |
GC et al. | Why big data industrial systems need rules and what we can do about it | |
CN106815265A (en) | The searching method and device of judgement document | |
Bingol et al. | Rumor Detection in Social Media using machine learning methods | |
US9424340B1 (en) | Detection of proxy pad sites | |
Tsuchiya et al. | Interactive recovery of requirements traceability links using user feedback and configuration management logs | |
Kopliku et al. | Towards a framework for attribute retrieval | |
CN106611029A (en) | Method and device for improving site search efficiency in website | |
Gossen et al. | Extracting event-centric document collections from large-scale web archives | |
Oza et al. | Elimination of noisy information from web pages | |
CN107526833A (en) | A kind of URL management methods, system | |
KR101556714B1 (en) | Method, system and computer readable recording medium for providing search results | |
Breja | A novel approach for novelty detection of web documents | |
Tourné et al. | Evaluating tag filtering techniques for web resource classification in folksonomies | |
CN106611022A (en) | Method and device for increasing website search efficiency | |
Abdelouarit et al. | Towards an approach based on hadoop to improve and organize online search results in big data environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |