CN106649362B - Webpage crawling method and device - Google Patents

Webpage crawling method and device Download PDF

Info

Publication number
CN106649362B
CN106649362B CN201510729544.6A CN201510729544A CN106649362B CN 106649362 B CN106649362 B CN 106649362B CN 201510729544 A CN201510729544 A CN 201510729544A CN 106649362 B CN106649362 B CN 106649362B
Authority
CN
China
Prior art keywords
keyword
task queue
keyword group
server
search engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510729544.6A
Other languages
Chinese (zh)
Other versions
CN106649362A (en
Inventor
何熠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510729544.6A priority Critical patent/CN106649362B/en
Publication of CN106649362A publication Critical patent/CN106649362A/en
Application granted granted Critical
Publication of CN106649362B publication Critical patent/CN106649362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a webpage crawling method and device. Wherein, the method comprises the following steps: the method comprises the steps that a plurality of servers respectively obtain key phrases from a task queue, wherein a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of key words; and the plurality of servers respectively crawl the search engine result pages corresponding to each keyword in the acquired keyword groups through respective web crawlers. The method and the device solve the technical problem of low efficiency in the prior art that the web crawler of a single server crawls the result page of the keyword search engine.

Description

Webpage crawling method and device
Technical Field
The application relates to the field of internet, in particular to a webpage crawling method and device.
Background
In a conventional Search Engine Optimization (SEO) service, it is generally required to help a user analyze the ranking of keywords in a Search Engine. Generally, a user presets a set of keywords, and periodically crawls a web crawler for ranking the keywords in a search engine, that is, crawls a search engine result page corresponding to the keywords by the web crawler, where the search engine result page corresponding to the keywords refers to a search result page displayed after the keywords are input in a search engine (e.g., a search engine such as hundredths, dog search, etc.).
However, in order to prevent a robot (e.g., a web crawler) from accessing or reduce abnormal access traffic, a search engine often limits a search speed or a search frequency of a single IP address (i.e., a crawler-reverse strategy), and generally, keywords specified by a user are accumulated to reach a very large number, so that crawling of a keyword search engine result page only by a single machine or an IP address is not only inefficient, but also search engine result pages corresponding to all keywords are easily crawled due to the crawler-reverse strategy of the search engine.
Aiming at the problem of low efficiency in the prior art that a web crawler of a single server crawls a search engine result page corresponding to a keyword, an effective solution is not provided at present.
Disclosure of Invention
The application mainly aims to provide a webpage crawling method and device to solve the problem that in the related art, when a web crawler of a single server crawls a keyword search engine result page, efficiency is low.
In order to achieve the above object, according to one aspect of the present application, there is provided a web page crawling method. The method comprises the following steps: the method comprises the steps that a plurality of servers respectively obtain key phrases from a task queue, wherein a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of key words; and the plurality of servers respectively crawl the search engine result pages corresponding to each keyword in the acquired keyword groups through respective web crawlers.
Further, the plurality of servers include a first server, the plurality of servers respectively obtaining the keyword group from the task queue includes the first server obtaining the keyword group from the task queue, and the first server obtaining the keyword group from the task queue includes: the method comprises the steps that a first server detects whether a keyword group to be crawled exists in a task queue; the method comprises the steps that when a first server detects that a keyword group to be crawled exists in a task queue, the task queue is locked, wherein the locked task queue can only be read by the first server; and the first server acquires the key phrase from the locked task queue and releases the task queue, wherein the released task queue can be read by any one of the servers.
Further, the multiple servers include a first server, the keyword group acquired by the first server is a first keyword group, the web crawler of the first server is a first web crawler, the multiple servers crawl the search engine result pages corresponding to each keyword in the acquired keyword group through respective web crawlers respectively, the first server crawls the search engine result pages corresponding to each keyword in the first keyword group through the first web crawler, and the search engine result pages corresponding to each keyword include: traversing the first keyword group, and crawling a search engine result page corresponding to each keyword in the first keyword group through a first web crawler; judging whether the first web crawler crawls a search engine result page corresponding to each keyword in the first keyword group successfully; and when judging that the condition that the search engine result page corresponding to the keyword in the first keyword group is crawled fails exists, adding the keyword which is crawled in the first keyword group and fails to crawl to a failure list.
Further, after the keyword in the first keyword group that fails to be crawled is added to the failure list, the method further comprises: packing the keywords in the failure list into a new keyword group; and adding the new key phrase into the task queue.
Further, packaging the keywords in the failure list into a new keyword group includes: obtaining the retry times of the keywords in the failure list; judging whether the retry times of the keywords in the failure list are smaller than a preset value or not; and when the retry times of the keywords in the failure list are judged to be smaller than a preset value, packaging the keywords in the failure list into a new keyword group.
Further, before the plurality of servers respectively obtain the keyword group from the task queue, the method further comprises: grouping the plurality of keywords according to a preset rule to obtain a plurality of groups of keyword groups; and storing the key phrases of the groups in the task queue according to the priority.
In order to achieve the above object, according to another aspect of the present application, there is provided a web page crawling apparatus. The device includes: the system comprises an acquisition unit, a task queue and a processing unit, wherein the acquisition unit is used for enabling a plurality of servers to respectively acquire key phrases from the task queue, a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of key words; and the crawling unit is used for enabling the plurality of servers to crawl the search engine result page corresponding to each keyword in the acquired keyword groups through respective web crawlers.
Further, the plurality of servers includes a first server, and the obtaining unit includes: the detection module is used for enabling the first server to detect whether a keyword group to be crawled exists in the task queue; the locking module is used for locking the task queue when the first server detects that the keyword group to be crawled exists in the task queue, wherein the locked task queue can only be read by the first server; and the acquisition module is used for enabling the first server to acquire the keyword group from the locked task queue and release the task queue, wherein the released task queue can be read by any one of the plurality of servers.
Further, many servers include first server, and the keyword group that first server obtained is first keyword group, and the web crawler of first server is first web crawler, and the unit of crawling includes: the crawling module is used for traversing the first keyword group and crawling a search engine result page corresponding to each keyword in the first keyword group through a first web crawler; the judgment module is used for judging whether the search engine result page corresponding to each keyword in the first keyword group crawled by the first web crawler succeeds; and the adding module is used for adding the keyword which fails to crawl in the first keyword group into the failure list when judging that the condition that the search engine result page corresponding to the keyword in the first keyword group fails to crawl exists.
Further, the apparatus further comprises: the packaging unit is used for packaging the keywords in the failure list into a new keyword group; and the adding unit is used for adding the new key phrase into the task queue.
The method comprises the steps that key phrases are respectively obtained from a task queue through a plurality of servers, wherein a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of keywords; the method and the system have the advantages that the multiple servers crawl the search engine result pages corresponding to each keyword in the acquired keyword groups through the respective web crawlers, the search engine result pages corresponding to the keywords are crawled in a distributed mode through the multiple servers, so that the efficiency of crawling the search engine result pages corresponding to the keywords can be improved, the possibility of triggering a search engine anti-crawler strategy can be reduced, the problem that the efficiency is low when the web crawlers of the single server crawl the keyword search engine result pages in the related technology is solved, and the effect of improving the efficiency of crawling the search engine result pages corresponding to the keywords is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 is a flow chart of a web page crawling method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a distributed crawl of web pages according to an embodiment of the present application; and
FIG. 3 is a schematic diagram of a web page crawling apparatus according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For ease of description, some concepts related to the present application are illustrated below:
search Engine Optimization, abbreviated as SEO. Search engine optimization is a way to utilize the search rules of a search engine to improve the ranking of targeted web sites within the relevant search engine.
The queue is a special linear table, which only allows deletion operation at the front end (front) of the table and insertion operation at the back end (rear) of the table, also called a First-In First-Out (First In First Out) linear table, which is referred to as FIFO table for short.
Web crawlers: also known as a web spider or web robot, is a program or script that automatically captures web information according to preset rules.
According to the embodiment of the application, a webpage crawling method is provided. Fig. 1 is a flowchart of a web page crawling method according to an embodiment of the present application, and as shown in fig. 1, the method includes steps S102 to S104 as follows:
step S102, multiple servers respectively obtain key phrases from a task queue, wherein multiple key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises multiple key words.
Specifically, a plurality of keywords preset by a user may be grouped by a scheduler, and a keyword group obtained after grouping is placed in a task queue. Preferably, in order to ensure that the search engine result page corresponding to the keyword with the high importance degree is preferentially crawled, before the plurality of servers respectively obtain the keyword group from the task queue, the method further includes: grouping the plurality of keywords according to a preset rule to obtain a plurality of groups of keyword groups; and storing the key phrases of the groups in the task queue according to the priority.
Specifically, in the embodiment of the present application, a plurality of keywords may be sorted according to importance (for example, the importance of the keyword that needs to be preferentially processed is high), a preset number of keywords are sequentially selected according to the sorting to form a group, for example, 300 keywords in total, the 300 keywords are sorted according to the importance, and according to the sorting of the 300 keywords, 50 keywords are sequentially selected to form a keyword group, so that 6 keyword groups may be obtained, and the 6 keyword groups are added to the task queue according to priorities, where the keyword group with a high importance is high in priority, and the keyword group with a low importance is low in priority.
In the embodiment of the present application, the servers respectively obtain the keyword groups from the task queue to perform the task of crawling the web page, as shown in fig. 2, all the keywords are divided into N groups by the scheduler, namely, the key phrase 1 to the key phrase N, three servers (namely, the server 1, the server 2 and the server 3) sequentially acquire the key phrases from the task queue, for example, server 1 obtains keyword group 1, server 2 obtains keyword group 2, and server 3 obtains keyword group 3, each server will record the current state of keyword group when processing the crawling task of the obtained keyword group, for example, the task status of the current keyword group (e.g., crawling success, crawling failure, waiting, crawling, etc.), the number of retries (e.g., its initial value may be set to 0, and the number of retries may be increased by 1 when crawling fails), etc. are recorded. And when the crawling task of each keyword in the obtained keyword group is processed, obtaining a new keyword group from the task queue again for processing, and so on.
It should be noted that, in the embodiment of the present application, the same keyword in a plurality of keywords may be merged to reduce the total amount of crawl. In the embodiment of the present application, a plurality of task queues may also be used to match different crawling strategies (e.g., different search engines, crawling limits, etc.), for example, the task queue 1 is used to store keywords based on a hundred-degree search, and the task queue 2 is used to store keywords based on a dog search.
Optionally, the multiple servers include a first server, the obtaining, by the multiple servers, the keyword group from the task queue includes the first server obtaining the keyword group from the task queue, and the obtaining, by the first server, the keyword group from the task queue includes: the method comprises the steps that a first server detects whether a keyword group to be crawled exists in a task queue; the method comprises the steps that when a first server detects that a keyword group to be crawled exists in a task queue, the task queue is locked, wherein the locked task queue can only be read by the first server; and the first server acquires the key phrase from the locked task queue and releases the task queue, wherein the released task queue can be read by any one of the servers.
The first server in the embodiment of the present application may be any one of a plurality of servers. Specifically, the first server may detect whether a keyword group to be crawled exists in the task queue, when it is detected that the keyword group to be crawled does not exist in the task queue, wait for a preset time, and then detect whether the keyword group to be crawled exists in the task queue again, and when it is detected that the keyword group to be crawled exists in the task queue, lock the task queue, so that other servers cannot access the task queue at this time, to avoid a conflict when multiple threads read the task queue at the same time, and release the task queue (i.e., unlock the task queue) after the first server reads the keyword group from the task queue, so that all servers can access the task queue.
And step S104, crawling the search engine result page corresponding to each keyword in the acquired keyword group by the plurality of servers through respective web crawlers.
In the embodiment of the application, the plurality of servers are provided with web crawlers, and after the plurality of servers respectively acquire the keyword groups, the plurality of servers crawl search engine result pages corresponding to each keyword in the acquired keyword groups through the respective web crawlers, wherein the search engine result pages corresponding to the keywords refer to search result pages displayed after the keywords are input in a search engine (for example, a search engine such as a hundred-degree search engine, a search dog search engine, and the like). Taking fig. 2 as an example for explanation, after the server 1 acquires the keyword group 1, the server 1 crawls a search engine result page corresponding to each keyword in the keyword group 1 through a web crawler of the server 1, and the process of crawling the web pages by the server 2 and the server 3 is the same as that of the server 1.
According to the embodiment of the application, the crawling task of processing the search engine result pages corresponding to the keywords in a distributed manner by the plurality of servers can improve the efficiency of crawling the search engine result pages corresponding to the keywords on the one hand, and on the other hand, when the number of the keywords is too large, the possibility of triggering the anti-crawler strategy of the search engine can be reduced, so that the search engine result pages corresponding to all the keywords can be obtained.
The method comprises the steps that a plurality of servers respectively obtain key phrases from a task queue, wherein the task queue stores a plurality of key phrases to be crawled, and each key phrase to be crawled comprises a plurality of keywords; the multiple servers respectively crawl the search engine result page corresponding to each keyword in the acquired keyword group through respective web crawlers, and the search engine result page corresponding to the keyword is crawled through the multiple servers in a distributed manner, so that the efficiency of crawling the search engine result page corresponding to the keyword can be improved, the possibility of triggering a search engine anti-crawler strategy can be reduced, the problem of low efficiency in the related technology that the web crawlers of the single server crawl the keyword search engine result page is solved, and the effect of improving the efficiency of crawling the search engine result page corresponding to the keyword is achieved.
Preferably, in order to avoid missing keywords that are failed in crawling, the plurality of servers include a first server, the keyword group obtained by the first server is a first keyword group, the web crawler of the first server is a first web crawler, the plurality of servers crawl the search engine result page corresponding to each keyword in the obtained keyword group through the respective web crawler respectively, the search engine result page corresponding to each keyword in the first keyword group includes that the first server crawls the search engine result page corresponding to each keyword in the first keyword group through the first web crawler, and the first server crawls the search engine result page corresponding to each keyword in the first keyword group through the first web crawler includes: traversing the first keyword group, and crawling a search engine result page corresponding to each keyword in the first keyword group through a first web crawler; judging whether the first web crawler crawls a search engine result page corresponding to each keyword in the first keyword group successfully; and when judging that the condition that the search engine result page corresponding to the keyword in the first keyword group is crawled fails exists, adding the keyword which is crawled in the first keyword group and fails to crawl to a failure list.
The first server is taken as an example to explain, specifically, the first server traverses each keyword in the first keyword group, and crawls a search engine result page corresponding to each keyword through a web crawler (i.e., the first web crawler), but the web crawler may fail to crawl the search engine result page corresponding to the keyword due to network abnormality, server abnormality, data parsing abnormality, triggering of an anti-crawler, and the like, that is, the search engine result page corresponding to the keyword is not successfully acquired. Therefore, in the embodiment of the application, the crawling result of the web crawler is detected, if the web crawler successfully crawls the search engine result page corresponding to each keyword in the keyword group, the output state of the keyword group is recorded as success, if the search engine result page corresponding to one or some keywords in the keyword group is failed to crawl, the keyword which fails to crawl is added to the failure list, so as to obtain the identifier of the failed keyword, record the processing state of the keyword group as failure, and add 1 to the retry number. According to the embodiment of the application, the keywords which are failed in crawling are recorded, so that the keywords which are failed in crawling can be prevented from being omitted.
It should be noted that when a server fails to crawl a search engine result page corresponding to a keyword, the server may be allowed to sleep for a preset time and then re-execute a crawling task.
Preferably, after the keyword in the first keyword group that fails to be crawled is added to the failure list, the method further comprises: packing the keywords in the failure list into a new keyword group; and adding the new key phrase into the task queue.
According to the embodiment of the application, after the keywords which are failed in crawling are added to the failure list, the keywords which are failed in crawling are obtained from the failure list, the keywords which are failed in crawling are packaged again to be a new keyword group and stored into the task queue, and therefore the keywords which are failed in crawling can be crawled again, data corresponding to the keywords which are failed in crawling can be prevented from being omitted, and the search engine result pages corresponding to all the keywords can be crawled.
Preferably, packaging the keywords in the failure list into a new keyword group includes: obtaining the retry times of the keywords in the failure list; judging whether the retry times of the keywords in the failure list are smaller than a preset value or not; and when the retry times of the keywords in the failure list are judged to be smaller than a preset value, packaging the keywords in the failure list into a new keyword group.
In practical situations, some keywords may be crawled for many times and still cannot be obtained successfully to obtain the corresponding search engine result pages, in order to save system resources, the crawling task for the keywords may be stopped, and the search engine result pages corresponding to the keywords may be obtained manually and the like.
Specifically, according to the method and the device, the retry times (namely the failure times) of the keyword group are recorded in advance, the retry times of the keywords in the failure list are obtained and compared with the preset value, if the retry times are smaller than the preset value, the keywords in the failure list are packaged and stored in the task queue, and if the retry times are larger than the preset value, the packaging processing is not performed.
From the above description, it can be seen that, in the embodiment of the present application, for a large number of keywords, the keywords are distributed to machines (i.e., servers) with different IP addresses in a manner of breaking up the whole into parts, so as to achieve the purpose of distributed crawling, and at the same time, the possibility of triggering anti-crawlers can be reduced; under the condition that crawling fails (usually triggered by an anti-crawler mechanism), crawling is attempted again in a keyword recombination mode, and each keyword can be guaranteed to crawl to data without being omitted; grouping the keywords and adding the keywords into a crawling queue, wherein the queue can be in a distributed queue component or a database form; the crawler actively applies for tasks, so that the crawling speed can be controlled according to actual conditions; the same keywords may be merged to reduce the total amount of crawl.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to another aspect of the embodiments of the present application, a web page crawling apparatus is provided, where the web page crawling apparatus may be configured to execute the web page crawling method according to the embodiments of the present application, and the web page crawling method according to the embodiments of the present application may also be executed by the web page crawling apparatus according to the embodiments of the present application.
Fig. 3 is a schematic diagram of a web page crawling apparatus according to an embodiment of the present application, as shown in fig. 3, the apparatus includes: an acquisition unit 10 and a crawling unit 20.
The acquiring unit 10 is configured to enable a plurality of servers to respectively acquire keyword groups from a task queue, where the task queue stores a plurality of keyword groups to be crawled, and each keyword group to be crawled includes a plurality of keywords.
Alternatively, the plurality of servers includes a first server, and the obtaining unit 10 includes: the detection module is used for enabling the first server to detect whether a keyword group to be crawled exists in the task queue; the locking module is used for locking the task queue when the first server detects that the keyword group to be crawled exists in the task queue, wherein the locked task queue can only be read by the first server; and the acquisition module is used for enabling the first server to acquire the keyword group from the locked task queue and release the task queue, wherein the released task queue can be read by any one of the plurality of servers.
And the crawling unit 20 is configured to enable the multiple servers to crawl a search engine result page corresponding to each keyword in the acquired keyword groups through respective web crawlers.
Optionally, the multiple servers include a first server, the keyword group obtained by the first server is a first keyword group, the web crawler of the first server is a first web crawler, and the crawling unit 20 includes: the crawling module is used for traversing the first keyword group and crawling a search engine result page corresponding to each keyword in the first keyword group through a first web crawler; the judgment module is used for judging whether the search engine result page corresponding to each keyword in the first keyword group crawled by the first web crawler succeeds; and the adding module is used for adding the keyword which fails to crawl in the first keyword group into the failure list when judging that the condition that the search engine result page corresponding to the keyword in the first keyword group fails to crawl exists.
In the embodiment of the application, a plurality of servers are enabled to respectively acquire keyword groups from a task queue through an acquisition unit 10, wherein a plurality of keyword groups to be crawled are stored in the task queue, and each keyword group to be crawled comprises a plurality of keywords; and the crawling unit 20 enables the plurality of servers to crawl the search engine result page corresponding to each keyword in the acquired keyword group through respective web crawlers, in the embodiment of the application, the search engine result pages corresponding to the keywords are crawled in a distributed manner through the plurality of servers, so that the efficiency of crawling the search engine result pages corresponding to the keywords can be improved, the possibility of triggering a search engine anti-crawler strategy can be reduced, the problem of low efficiency in crawling the keyword search engine result pages through the web crawlers of a single server in the related technology is solved, and the effect of improving the efficiency of crawling the search engine result pages corresponding to the keywords is achieved.
Preferably, the apparatus further comprises: the packaging unit is used for packaging the keywords in the failure list into a new keyword group; and the adding unit is used for adding the new key phrase into the task queue.
The webpage crawling device comprises a processor and a memory, the acquiring unit, the crawling unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the search engine result page corresponding to the keyword is crawled by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: the method comprises the steps that a plurality of servers respectively obtain key phrases from a task queue, wherein a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of key words; and the plurality of servers respectively crawl the search engine result pages corresponding to each keyword in the acquired keyword groups through respective web crawlers.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (8)

1. A method for crawling a web page, comprising:
the method comprises the steps that a plurality of servers respectively obtain key phrases from a task queue, wherein a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of key words; and
the plurality of servers respectively crawl a search engine result page corresponding to each keyword in the acquired keyword group through respective web crawlers;
the plurality of servers include a first server, the plurality of servers respectively obtain the keyword group from the task queue, the first server obtains the keyword group from the task queue, and the first server obtains the keyword group from the task queue includes: the first server detects whether a keyword group to be crawled exists in the task queue; the method comprises the steps that when the first server detects that a keyword group to be crawled exists in a task queue, the task queue is locked, wherein the locked task queue can only be read by the first server; and the first server acquires the key phrase from the locked task queue and releases the task queue, wherein the released task queue can be read by any one of the servers.
2. The method of claim 1, wherein the plurality of servers includes a first server, the keyword group obtained by the first server is a first keyword group, the web crawler of the first server is a first web crawler, the crawling, by the respective web crawler, of the plurality of servers for the search engine result page corresponding to each keyword in the obtained keyword group includes crawling, by the first web crawler, for the search engine result page corresponding to each keyword in the first keyword group, and the crawling, by the first server for the search engine result page corresponding to each keyword in the first keyword group by the first web crawler includes:
traversing the first keyword group, and crawling a search engine result page corresponding to each keyword in the first keyword group through the first web crawler;
judging whether the first web crawler crawls a search engine result page corresponding to each keyword in the first keyword group successfully; and
and when judging that the condition that the search engine result page corresponding to the keyword in the first keyword group is crawled fails exists, adding the keyword which is crawled in the first keyword group and fails to crawl to a failure list.
3. The method of claim 2, wherein after adding the keyword in the first keyword group that failed to crawl to a failure list, the method further comprises:
packing the keywords in the failure list into a new keyword group; and
and adding the new key phrase into the task queue.
4. The method of claim 3, wherein packaging the keywords in the failure list as a new keyword group comprises:
obtaining the retry times of the keywords in the failure list;
judging whether the retry times of the keywords in the failure list are smaller than a preset value or not; and
and when the retry times of the keywords in the failure list are judged to be smaller than the preset value, packaging the keywords in the failure list into a new keyword group.
5. The method of claim 1, wherein before the plurality of servers respectively obtain the keyword sets from the task queue, the method further comprises:
grouping the plurality of keywords according to a preset rule to obtain a plurality of groups of keyword groups; and
and storing the key phrases of the groups into the task queue according to the priority.
6. A web page crawling apparatus, comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for enabling a plurality of servers to respectively acquire key phrases from a task queue, a plurality of key phrases to be crawled are stored in the task queue, and each key phrase to be crawled comprises a plurality of key words; and
the crawling unit is used for enabling the servers to crawl a search engine result page corresponding to each keyword in the acquired keyword groups through respective web crawlers;
the plurality of servers include a first server, and the acquisition unit includes:
the detection module is used for enabling the first server to detect whether a keyword group to be crawled exists in the task queue;
the locking module is used for locking the task queue when the first server detects that the keyword group to be crawled exists in the task queue, wherein the locked task queue can only be read by the first server; and
and the obtaining module is used for enabling the first server to obtain the keyword group from the locked task queue and release the task queue, wherein the released task queue can be read by any one of the servers.
7. The apparatus of claim 6, wherein the plurality of servers includes a first server, the keyword group obtained by the first server is a first keyword group, the web crawler of the first server is a first web crawler, and the crawling unit includes:
the crawling module is used for traversing the first keyword group and crawling a search engine result page corresponding to each keyword in the first keyword group through the first web crawler;
the judging module is used for judging whether the first web crawler successfully crawls the search engine result page corresponding to each keyword in the first keyword group; and
and the adding module is used for adding the keyword which fails to crawl in the first keyword group to a failure list when judging that the condition that the search engine result page corresponding to the keyword in the first keyword group fails to crawl exists.
8. The apparatus of claim 7, further comprising:
a packing unit, configured to pack the keywords in the failure list into a new keyword group; and
and the adding unit is used for adding the new key phrase into the task queue.
CN201510729544.6A 2015-10-30 2015-10-30 Webpage crawling method and device Active CN106649362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510729544.6A CN106649362B (en) 2015-10-30 2015-10-30 Webpage crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510729544.6A CN106649362B (en) 2015-10-30 2015-10-30 Webpage crawling method and device

Publications (2)

Publication Number Publication Date
CN106649362A CN106649362A (en) 2017-05-10
CN106649362B true CN106649362B (en) 2020-02-07

Family

ID=58809462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510729544.6A Active CN106649362B (en) 2015-10-30 2015-10-30 Webpage crawling method and device

Country Status (1)

Country Link
CN (1) CN106649362B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133217A (en) * 2017-05-26 2017-09-05 北京惠商之星网络科技有限公司 Target topic intelligent grabbing method, system and computer-readable recording medium
CN110020041B (en) * 2017-08-21 2021-10-08 北京国双科技有限公司 Method and device for tracking crawling process
CN110147473B (en) * 2017-08-28 2022-03-01 北京国双科技有限公司 Crawling method and device for crawler
CN108170551B (en) * 2018-01-03 2021-06-22 深圳壹账通智能科技有限公司 Crawler system based front-end and back-end error processing method, server and storage medium
CN109657462B (en) * 2018-12-06 2021-05-11 贵阳货车帮科技有限公司 Data detection method, system, electronic device and storage medium
CN109815380A (en) * 2018-12-20 2019-05-28 山东中创软件工程股份有限公司 A kind of information crawler method, apparatus, equipment and computer readable storage medium
CN110287444B (en) * 2019-07-02 2021-06-25 郑州悉知信息科技股份有限公司 Website detection method and device and storage medium
CN110928711A (en) * 2019-11-26 2020-03-27 多点(深圳)数字科技有限公司 Task processing method, device, system, server and storage medium
US11941073B2 (en) 2019-12-23 2024-03-26 97th Floor Generating and implementing keyword clusters
CN111460254B (en) * 2020-03-24 2023-05-05 南阳柯丽尔科技有限公司 Webpage crawling method and device based on multithreading, storage medium and equipment
CN113239253B (en) * 2021-04-09 2024-02-23 北京皮尔布莱尼软件有限公司 Method, system, computing device and storage medium for realizing web crawler

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334784A (en) * 2008-07-30 2008-12-31 施章祖 Computer auxiliary report and knowledge base generation method
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
CN101916291A (en) * 2010-08-26 2010-12-15 北京大学 Method for crawling eDonkey network shared file and client information
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104077402A (en) * 2014-07-04 2014-10-01 用友软件股份有限公司 Data processing method and data processing system
CN104199830A (en) * 2014-07-31 2014-12-10 渠成 Search engine optimization big data management platform
WO2015039165A1 (en) * 2013-09-19 2015-03-26 Longtail Ux Pty Ltd Improvements in website traffic optimization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334784A (en) * 2008-07-30 2008-12-31 施章祖 Computer auxiliary report and knowledge base generation method
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
CN101916291A (en) * 2010-08-26 2010-12-15 北京大学 Method for crawling eDonkey network shared file and client information
WO2015039165A1 (en) * 2013-09-19 2015-03-26 Longtail Ux Pty Ltd Improvements in website traffic optimization
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104077402A (en) * 2014-07-04 2014-10-01 用友软件股份有限公司 Data processing method and data processing system
CN104199830A (en) * 2014-07-31 2014-12-10 渠成 Search engine optimization big data management platform

Also Published As

Publication number Publication date
CN106649362A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649362B (en) Webpage crawling method and device
CN107395659B (en) Method and device for service acceptance and consensus
CN104049986B (en) plug-in loading method and device
TWI493378B (en) Method and device for uploading files
US9336203B2 (en) Semantics-oriented analysis of log message content
CN111046011B (en) Log collection method, system, device, electronic equipment and readable storage medium
US20140215631A1 (en) Method and system for monitoring webpage malicious attributes
CN107608860B (en) Method, device and equipment for classified storage of error logs
CN105515900B (en) A kind of method and device obtaining terminal presence
CN108847977A (en) A kind of monitoring method of business datum, storage medium and server
CN105589943B (en) The method, apparatus and server of the picture adaptive processes of result of page searching
CN104572727A (en) Data querying method and device
CN106326222B (en) A kind of data processing method and device
WO2014044109A1 (en) Method and apparatus for virus scanning
CN104239353B (en) WEB classification control and log audit method
JP2010072984A (en) Log management server
WO2014153970A1 (en) Method and apparatus for text input protection
EP3373162B1 (en) Data persistence method and system in stream computing
CN105138912A (en) Method and device for generating phishing website detection rules automatically
CN108132948B (en) Method and device for processing crawled webpage
CN105512329A (en) GO-language-based dynamic figure cutting system
CN105159820A (en) Transmission method and device of system log data
US9477769B2 (en) Method and system for detecting original document of web document, method and system for providing history information of web document for the same
CN109213972B (en) Method, device, equipment and computer storage medium for determining document similarity
Hurst et al. Social streams blog crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant