CN106528567B - The update method and device of web crawlers cluster information - Google Patents

The update method and device of web crawlers cluster information Download PDF

Info

Publication number
CN106528567B
CN106528567B CN201510579940.5A CN201510579940A CN106528567B CN 106528567 B CN106528567 B CN 106528567B CN 201510579940 A CN201510579940 A CN 201510579940A CN 106528567 B CN106528567 B CN 106528567B
Authority
CN
China
Prior art keywords
target
link
web crawlers
local
broadcast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510579940.5A
Other languages
Chinese (zh)
Other versions
CN106528567A (en
Inventor
崔志伸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510579940.5A priority Critical patent/CN106528567B/en
Publication of CN106528567A publication Critical patent/CN106528567A/en
Application granted granted Critical
Publication of CN106528567B publication Critical patent/CN106528567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

This application discloses the update methods and device of a kind of web crawlers cluster information.Wherein, each web crawlers is equipped with a local detector in web crawlers cluster, this method comprises: target local detector is inquired in the target local detector according to the message that its corresponding web crawlers is sent crawls link with the presence or absence of target, wherein, target is carried in message crawl link;Inquire crawl link there is no target when, target local detector saves target and crawls link, and sends the broadcast for carrying target and crawling link to other local detectors, so that other local detectors crawl link according to broadcast update.The relatively low technical problem of efficiency is crawled present application addresses web crawlers in the related technology.

Description

The update method and device of web crawlers cluster information
Technical field
This application involves internet crawler fields, in particular to a kind of update method of web crawlers cluster information And device.
Background technique
Web crawlers cluster needs to filter duplicate link when crawling various websites, to prevent duplicate pages by repeatedly It crawls.During web crawlers crawls the page, the link crawled is stored in the inspection for filtering repeated pages In device, in order to which each crawler in web crawlers cluster is owned by detector identical as far as possible at any time, duplicate pages are avoided Face is crawled again, and therefore, it is necessary to synchronized update detectors.
Existing scheme disposes a unified detector in the cluster, and all-network crawler can all access the same detector Exclude duplicate pages, but this scheme makes the all-network crawler in cluster that will compete the same detector resource, It when each web crawlers crawls the page, requires whether the link that detector inspection crawls repeats, leads to crawling for web crawlers Efficiency is relatively low.
In view of the above-mentioned problems, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the present application provides the update method and device of a kind of web crawlers cluster information, at least to solve correlation Web crawlers crawls the relatively low technical problem of efficiency in technology.
According to the one aspect of the embodiment of the present application, a kind of update method of web crawlers cluster information is provided, it is described Each web crawlers is equipped with a local detector in web crawlers cluster, and target local detector is climbed according to its corresponding network The information query that worm sends crawls link with the presence or absence of target in the target local detector, wherein carries in the message There is the target to crawl link;Inquire crawl link there is no the target when, save the target and crawl link, and to Other local detector transmissions carry the broadcast that the target crawls link, so that other described local detectors are according to described in Broadcast, which updates, crawls link.
According to the another aspect of the embodiment of the present application, a kind of updating device of web crawlers cluster information, institute are additionally provided It states each web crawlers in web crawlers cluster and is equipped with a local detector, described device includes: query unit, is used for basis The information query that detector corresponding web crawlers in target local is sent whether there is target in the detector of the target local Crawl link, wherein carry the target in the message and crawl link;Radio unit, for inquiring there is no institute When stating target and crawling link, saves the target and crawl link, and carry the target to other local detectors transmissions and climb The broadcast of link is taken, so that other described local detectors crawl link according to broadcast update.
In the embodiment of the present application, the information query sent using target local detector according to its corresponding web crawlers Link is crawled with the presence or absence of target in the target local detector, wherein is carried target in message and is crawled link;It is inquiring When crawling link there is no target out, saves target and crawl link, and carry target to other local detector transmissions and crawl The broadcast of link, in a manner of making other local detectors crawl link according to broadcast update, each web crawlers passes through one Corresponding local detector filtering repeats target and crawls link, improves and crawls efficiency.Meanwhile each local detector passes through extensively Broadcast the information for receiving the link that synchronized update had crawled, the chain that can also have been crawled by broadcast transmission synchronized update The information connect also ensures that between different crawlers so that the local detector in web crawlers cluster possesses consistent information It will not repeat to crawl the same link, when multiple crawlers are performed simultaneously and crawl task, that is, can guarantee and relatively high crawl effect Rate, and can guarantee relatively high accuracy, and then solve web crawlers in the related technology crawls the relatively low technology of efficiency Problem.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:
Fig. 1 is the flow chart according to the update method of the web crawlers cluster information of the embodiment of the present application;
Fig. 2 is the schematic diagram according to a kind of optional web crawlers cluster topology of the embodiment of the present application;
Fig. 3 is the schematic diagram according to the updating device of the web crawlers cluster information of the embodiment of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
According to the embodiment of the present application, a kind of embodiment of the method for the update method of web crawlers cluster information is provided, is needed It is noted that step shown in the flowchart of the accompanying drawings can be in the computer system of such as a group of computer-executable instructions Middle execution, although also, logical order is shown in flow charts, and it in some cases, can be to be different from herein Sequence executes shown or described step.
Fig. 1 is according to the flow chart of the update method of the web crawlers cluster information of the embodiment of the present application, the web crawlers Each web crawlers is equipped with a local detector in cluster, as shown in Figure 1, this method comprises the following steps:
Step S102, the information query that target local detector is sent according to its corresponding web crawlers is in target local Link is crawled with the presence or absence of target in detector, wherein is carried target in message and is crawled link.
Step S104, inquire there is no target crawl link when, target local detector save target crawl link, And the broadcast for carrying target and crawling link is sent to other local detectors, so that other local detectors are updated according to broadcast Crawl link.
Each web crawlers in web crawlers cluster is equipped with a local detector, and target local detector can be Local detector corresponding to any one web crawlers in web crawlers cluster.It is inquired when using target local detector It is out-of-date that some link is not crawled, and corresponding web crawlers can crawl the link, and target local detector passes through To send the message that the link has been crawled, other the local detectors for receiving the broadcast store the link for broadcast, in order to Web crawlers corresponding to the detector of the link is stored with when being crawled, filter out the link avoid repeat crawl it is same A link.Since the local detector of each web crawlers in web crawlers cluster can receive broadcast, web crawlers The local detector of web crawlers in cluster being capable of the information that is locally stored of synchronized update.In this embodiment, the side of broadcast Formula realizes the information of multiple local detector synchronized update detectors, and no matter which local detector web crawlers cluster utilizes Repeated links are filtered, repeated links can be avoided accurately to be filtered out.Since each web crawlers corresponds to a local detector, Web crawlers carries out the inspection of repeated links using its corresponding local detector, without seizing the money of the same detector Source improves the efficiency of filtering repeated links, and also just improve web crawlers crawls efficiency.The link storage crawled In the local detector of each of web crawlers cluster, also allows for each crawler and pass through respective local detector filtering weight Multiple link is all that accurately, i.e., can also improve the accuracy of filtering repeated links while improving and crawling efficiency, reach standard Really, the effect efficiently crawled.
Optionally, local in the target according to the information query of its corresponding web crawlers transmission in target local detector After crawling link with the presence or absence of target in detector, method further include: inquire crawl link there is no target when, target Local detector sends the instruction for allowing to crawl to its corresponding web crawlers, so that web crawlers crawls target and crawls link; Inquire crawl link there are target when, target local detector sends the finger for abandoning crawling to its corresponding web crawlers It enables, so that web crawlers is abandoned crawling target and crawls link.
It it is inquired by target local detector whether is stored with target and crawl link, can find and just illustrate that the target crawls Link had crawled, and does not need to crawl again, then corresponding web crawlers is notified not crawl link to target and crawl;It cannot It finds and just illustrates that the target crawls link and do not crawled, can be crawled, then corresponding web crawlers is notified to climb target Link is taken to be crawled.It crawls whether link crawled due to first inquiring the target before crawling, avoids identical Target, which crawls link and is repeated, to be crawled.Since the link information that crawls of the local detector of each of web crawlers cluster is synchronous , therefore, each web crawlers inquiry target can crawl whether link is crawled from corresponding local detector, avoid The same local detector is seized, the efficiency of inquiry is improved, as a complete unit, also improves the efficiency crawled.
As shown in Fig. 2, web crawlers A, which crawls target, crawls link www.abcdefg.com, web crawlers A is in local inspection It looks into device a and searches the Object linking, if not finding the Object linking in local detector a, web crawlers A is crawled Target crawls link www.abcdefg.com.If finding the Object linking in local detector a, it is determined that the target Link had crawled, and abandoned crawling the Object linking, the repetition so as to avoid same link crawls.
Specifically, web crawlers cluster further includes broadcast module, and target local detector is sent to other local detectors It carries target and crawls the broadcast of link to include: target local detector carry target to broadcast module transmission and crawl link Information is crawled, so that broadcast module generates broadcast according to information is crawled, and will be broadcast to by broadcast module and subscribe to broadcast Other local detectors.Local detector sends broadcast by the broadcast module in web crawlers cluster, also receives from broadcast The broadcast that module is sent, to realize all local detector synchronized updates in web crawlers cluster.In web crawlers cluster Other web crawlers can receive broadcast, and record target and crawl link, realize the respective local of multiple web crawlers The link crawled can be stored in detector.
For example, as shown in Fig. 2, web crawlers cluster includes web crawlers A, web crawlers B, web crawlers C ... network Crawler N etc., corresponding local detector is local detector a, local detector b, the local detector local detector n of c ..., Web crawlers cluster further includes broadcast module X, and all web crawlers for subscribing to broadcast may listen to broadcast module X transmission Broadcast.Web crawlers A crawls target and crawls link www.abcdefg.com, and web crawlers A is searched in local detector a should As a result Object linking does not find the Object linking in local detector a, then web crawlers A crawls target and crawls link www.abcdefg.com.Local detector a crawls link www.abcdefg.com to broadcast module X transmission target and has crawled Information, broadcast module X generates broadcast, which carries www.abcdefg.com.Subscribe to the local detector for having the broadcast B, the www.abcdefg.com that broadcast carries is stored in local by the local local detector n of detector c ....In web crawlers B Need to crawl target crawl link www.abcdefg.com when, local detector b finds the target and crawls link, then network Crawler B no longer crawls www.abcdefg.com.Web crawlers B, which is crawled in another local detector b, does not have the link of storage Afterwards, it can also send and crawl information, process is referring to local detector a, and details are not described herein again.
Optionally, target local detector sends the broadcast for carrying target and crawling link to other local detectors, with So that other local detectors foundation broadcast updates is crawled link includes: that local detector is carried to other local detector transmissions Target crawls the broadcast of link, so that other local detectors receive broadcast, and saves the target that broadcast carries and crawls link.
It, can be by web crawlers cluster due to each crawler corresponding one local detector in web crawlers cluster It adds a crawler and corresponding local detector carrys out extended network crawler cluster, or remove one from web crawlers cluster Crawler changes web crawlers cluster with corresponding local detector.When increasing a crawler and corresponding local detector, It only needs corresponding local detector to subscribe to the broadcast of broadcast module, the more new information of broadcast module transmission can be received, guarantee The update synchronizing information of multiple local detectors.In this way, the information stored in multiple local detectors is consistent, either increase Local detector still reduces local detector, all will not influence the local detector filtering of remaining in web crawlers cluster and repeats chain It connects, the accuracy that web crawlers crawls link will not be influenced.Since each crawler corresponds to a local detector, pairs of When increasing or reduce crawler and corresponding local detector, will not reduce other crawlers crawls efficiency.
Through the foregoing embodiment, each web crawlers repeats target by a corresponding local detector filtering and crawls chain It connects, improves and crawl efficiency.Meanwhile each local detector passes through the letter for the link that broadcast reception synchronized update had crawled Breath, can also be by the information for the link that broadcast transmission synchronized update had crawled, so that the local in web crawlers cluster Detector possesses consistent information, and also ensuring that will not repeat to crawl the same link between different crawlers, climbs multiple Worm is performed simultaneously when crawling task, that is, can guarantee the relatively high efficiency that crawls, and can guarantee relatively high accuracy.
According to the embodiment of the present application, a kind of Installation practice of the updating device of web crawlers cluster information is additionally provided, Each web crawlers is equipped with a local detector in web crawlers cluster, and the updating device of the web crawlers cluster information can The update method of above-mentioned web crawlers cluster information is executed, the update method of above-mentioned web crawlers cluster information can also be by this The updating device of web crawlers cluster information executes.
As shown in figure 3, the updating device of the web crawlers cluster information includes: that query unit 10 is used for according to target local The information query that the corresponding web crawlers of detector is sent crawls link with the presence or absence of target in the detector of the target local, Wherein, the target is carried in the message crawl link;Radio unit 30 is used to climb there is no the target inquiring When taking link, saves the target and crawl link, and carry the target to other local detector transmissions and crawl link Broadcast, so that other described local detectors crawl link according to broadcast update.
Each web crawlers in web crawlers cluster is equipped with a local detector, when using local detector judgement Some link is not crawled out-of-date out, can be crawled to the link, and send the link by broadcasting and climbed The message taken, the local detector for receiving the broadcast stores the link, in order to be stored with corresponding to the detector of the link Web crawlers filters out the link and avoids repeating to crawl the same link when being crawled.Due in web crawlers cluster The local detector of each web crawlers can receive broadcast, and therefore, the local of web crawlers in web crawlers cluster checks Device being capable of the information that is locally stored of synchronized update.In this embodiment, it is synchronous to realize multiple local detectors for the mode of broadcast The information of detector is updated, no matter which local detector filtering repeated links web crawlers just utilizes, and can avoid repeating chain It connects and is accurately filtered out.Due to corresponding one local detector of each web crawlers, web crawlers utilizes its corresponding local inspection The inspection that device carries out repeated links is looked into, without seizing the resource of the same detector, improves the effect of filtering repeated links Rate, also just improve web crawlers crawls efficiency.The link crawled is stored in each of web crawlers cluster In ground detector, also allow for each crawler and filter repeated links by respective local detector to be accurately, that is, to mention Also the accuracy that filtering repeated links can be improved while height crawls efficiency, achieved the effect that it is accurate, efficiently crawled.
Optionally, device further include: the first transmission unit, for being climbed in target local detector according to its corresponding network After the information query that worm sends crawls link with the presence or absence of target in the target local detector, mesh is not present inquiring When mark crawls link, the instruction for allowing to crawl is sent to the corresponding web crawlers of target local detector, so that web crawlers is climbed Target is taken to crawl link;Second transmission unit, for inquire crawl link there are target when, target local detector is to mesh Sample the corresponding web crawlers of detector send the instruction for abandoning crawling so that web crawlers is abandoned crawling target and crawls chain It connects.
It it is inquired by target local detector whether is stored with target and crawl link, can find and just illustrate that the target crawls Link had crawled, and does not need to crawl again, then corresponding web crawlers is notified not crawl link to target and crawl;It cannot It finds and just illustrates that the target crawls link and do not crawled, can be crawled, then corresponding web crawlers is notified to climb target Link is taken to be crawled.It crawls whether link crawled due to first inquiring the target before crawling, avoids identical Target, which crawls link and is repeated, to be crawled.Since the link information that crawls of the local detector of each of web crawlers cluster is synchronous , therefore, each web crawlers inquiry target can crawl whether link is crawled from corresponding local detector, avoid The same local detector is seized, the efficiency of inquiry is improved, as a complete unit, also improves the efficiency crawled.
As shown in Fig. 2, web crawlers A, which crawls target, crawls link www.abcdefg.com, web crawlers A is in local inspection It looks into device a and searches the Object linking, if not finding the Object linking in local detector a, web crawlers A is crawled Target crawls link www.abcdefg.com.If finding the Object linking in local detector a, it is determined that the target Link had crawled, and abandoned crawling the Object linking, the repetition so as to avoid same link crawls.
Specifically, web crawlers cluster further includes broadcast module, and radio unit includes: sending module, is used for broadcast mould Block send carry that target crawls link crawl information so that broadcast module generates broadcast according to information is crawled, and will broadcast It is sent to other the local detectors for subscribing to broadcast.
Local detector sends broadcast by the broadcast module in web crawlers cluster, also receives and sends from broadcast module Broadcast, to realize all local detector synchronized updates in web crawlers cluster.Other nets in web crawlers cluster Network crawler can receive broadcast, and record target and crawl link, realize in the respective local detector of multiple web crawlers The link crawled can be stored.
For example, as shown in Fig. 2, web crawlers cluster includes web crawlers A, web crawlers B, web crawlers C ... network Crawler N etc., corresponding local detector is local detector a, local detector b, the local detector local detector n of c ..., Web crawlers cluster further includes broadcast module X, and all web crawlers for subscribing to broadcast may listen to broadcast module X transmission Broadcast.Web crawlers A crawls target and crawls link www.abcdefg.com, and web crawlers A is searched in local detector a should As a result Object linking does not find the Object linking in local detector a, then web crawlers A crawls target and crawls link www.abcdefg.com.Local detector a crawls link www.abcdefg.com to broadcast module X transmission target and has crawled Information, broadcast module X generates broadcast, which carries www.abcdefg.com.Subscribe to the local detector for having the broadcast B, the www.abcdefg.com that broadcast carries is stored in local by the local local detector n of detector c ....In web crawlers B Need to crawl target crawl link www.abcdefg.com when, local detector b finds the target and crawls link, then network Crawler B no longer crawls www.abcdefg.com.Web crawlers B, which is crawled in another local detector b, does not have the link of storage Afterwards, it can also send and crawl information, process is referring to local detector a, and details are not described herein again.
Optionally, radio unit is also used to send the broadcast for carrying target and crawling link to other local detectors, with So that other local detectors is received broadcast, and saves the target that broadcast carries and crawl link.
It, can be by web crawlers cluster due to each crawler corresponding one local detector in web crawlers cluster It adds a crawler and corresponding local detector carrys out extended network crawler cluster, or remove one from web crawlers cluster Crawler changes web crawlers cluster with corresponding local detector.When increasing a crawler and corresponding local detector, It only needs corresponding local detector to subscribe to the broadcast of broadcast module, the more new information of broadcast module transmission can be received, guarantee The update synchronizing information of multiple local detectors.In this way, the information stored in multiple local detectors is consistent, either increase Local detector still reduces local detector, all will not influence the local detector filtering of remaining in web crawlers cluster and repeats chain It connects, the accuracy that web crawlers crawls link will not be influenced.Since each crawler corresponds to a local detector, pairs of When increasing or reduce crawler and corresponding local detector, will not reduce other crawlers crawls efficiency.
Through the foregoing embodiment, each web crawlers repeats target by a corresponding local detector filtering and crawls chain It connects, improves and crawl efficiency.Meanwhile each local detector passes through the letter for the link that broadcast reception synchronized update had crawled Breath, can also be by the information for the link that broadcast transmission synchronized update had crawled, so that the local in web crawlers cluster Detector possesses consistent information, and also ensuring that will not repeat to crawl the same link between different crawlers, climbs multiple Worm is performed simultaneously when crawling task, that is, can guarantee the relatively high efficiency that crawls, and can guarantee relatively high accuracy.
Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
In above-described embodiment of the application, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of unit, can be one kind Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of unit or module, It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered It is considered as the protection scope of the application.

Claims (8)

1. a kind of update method of web crawlers cluster information, which is characterized in that each network is climbed in the web crawlers cluster Worm is equipped with a local detector, which comprises
Target local detector according to the message that its corresponding web crawlers is sent inquired in the target local detector whether There are targets to crawl link, wherein carries the target in the message and crawls link;
Inquire crawl link there is no the target when, target local detector saves the target and crawls link, And the broadcast for carrying the target and crawling link is sent to other local detectors, so that other described local detector foundations The broadcast update crawls link;
Wherein, the link crawled is stored in the local detector of each of web crawlers cluster.
2. the method according to claim 1, wherein in target local detector according to its corresponding network After the information query that crawler sends crawls link with the presence or absence of target in the target local detector, the method is also wrapped It includes:
Inquire crawl link there is no the target when, target local detector is sent to its corresponding web crawlers The instruction for allowing to crawl, so that the web crawlers crawls the target and crawls link;
Inquire crawl link there are the target when, target local detector is put to its corresponding web crawlers transmission The instruction that crawls is abandoned, so that the web crawlers is abandoned crawling the target and crawls link.
3. described the method according to claim 1, wherein the web crawlers cluster further includes broadcast module Target local detector sends to carry the target and crawl the broadcast of link to other local detectors
Target local detector sent to the broadcast module carry that the target crawls link crawl information so that The broadcast module crawls information according to and generates the broadcast, and described broadcast to is subscribed to other locals of broadcast Detector.
4. the method according to claim 1, wherein target local detector is sent out to other local detectors It send and carries the broadcast that the target crawls link, so that other described local detectors crawl link according to broadcast update Include:
The local detector sends the broadcast for carrying the target and crawling link to other local detectors so that it is described its He receives the broadcast by local detector, and saves the target that the broadcast carries and crawl link.
5. a kind of updating device of web crawlers cluster information, which is characterized in that each network is climbed in the web crawlers cluster Worm is equipped with a local detector, and described device includes:
Query unit, the information query for being sent according to the corresponding web crawlers of target local detector are local in the target Link is crawled with the presence or absence of target in detector, wherein is carried the target in the message and is crawled link;
Radio unit, for inquire crawl link there is no the target when, save the target and crawl link, and to its He carries the broadcast that the target crawls link at local detector transmission, so that other described local detectors are according to described wide It broadcasts update and crawls link;
Wherein, the link crawled is stored in the local detector of each of web crawlers cluster.
6. device according to claim 5, which is characterized in that described device further include:
First transmission unit, for being existed in target local detector according to the information query that its corresponding web crawlers is sent After crawling link with the presence or absence of target in the target local detector, inquire crawl link there is no the target when, The instruction for allowing to crawl is sent to the corresponding web crawlers of target local detector so that the web crawlers crawl it is described Target crawls link;
Second transmission unit, for inquire crawl link there are the target when, target local detector is to described Detector corresponding web crawlers in target local sends the instruction for abandoning crawling, so that the web crawlers is abandoned crawling the mesh Mark crawls link.
7. device according to claim 5, which is characterized in that the web crawlers cluster further includes broadcast module, described Radio unit includes:
Sending module, for sent to the broadcast module carry that the target crawls link crawl information so that described Broadcast module crawls information according to and generates the broadcast, and described other locals subscribed to and broadcasted that broadcast to are checked Device.
8. device according to claim 7, which is characterized in that the radio unit is also used to other local detector hairs It send and carries the broadcast that the target crawls link, so that other described local detectors receive the broadcast, and described in preservation The target that broadcast carries crawls link.
CN201510579940.5A 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information Active CN106528567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510579940.5A CN106528567B (en) 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510579940.5A CN106528567B (en) 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information

Publications (2)

Publication Number Publication Date
CN106528567A CN106528567A (en) 2017-03-22
CN106528567B true CN106528567B (en) 2019-11-12

Family

ID=58348122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510579940.5A Active CN106528567B (en) 2015-09-11 2015-09-11 The update method and device of web crawlers cluster information

Country Status (1)

Country Link
CN (1) CN106528567B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965371B (en) * 2021-10-19 2023-08-29 北京天融信网络安全技术有限公司 Task processing method, device, terminal and storage medium in website monitoring process

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298633A (en) * 2011-09-08 2011-12-28 厦门市美亚柏科信息股份有限公司 Method and system for investigating repeated data in distributed mass data
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103258036A (en) * 2013-05-15 2013-08-21 广州一呼百应网络技术有限公司 Distributed real-time search engine based on p2p
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298633A (en) * 2011-09-08 2011-12-28 厦门市美亚柏科信息股份有限公司 Method and system for investigating repeated data in distributed mass data
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103258036A (en) * 2013-05-15 2013-08-21 广州一呼百应网络技术有限公司 Distributed real-time search engine based on p2p
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler

Also Published As

Publication number Publication date
CN106528567A (en) 2017-03-22

Similar Documents

Publication Publication Date Title
CN109413109B (en) Heaven and earth integrated network oriented security state analysis method based on finite-state machine
CN107360162B (en) Network application protection method and device
CN102685224B (en) User behavior analysis method, related equipment and system
CN104363253B (en) Website security detection method and device
CN103189836A (en) Method for classification of objects in a graph data stream
CN104378389B (en) Website security detection method and device
CN104363251B (en) Website security detection method and device
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN104363252B (en) Website security detection method and device
CN103593413A (en) Meta-search engine personalizing method based on Agent
GB2445084B (en) Method and apparatus for clustered filtering in an rfid infrastructure
US10491606B2 (en) Method and apparatus for providing website authentication data for search engine
CN107967279A (en) The data-updating method and device of distributed data base
CN107438111A (en) Method, server and the system of method and the domain name agency of inquiry of the domain name
CN106528567B (en) The update method and device of web crawlers cluster information
CN105653580A (en) Feature information determination and judgment methods and devices as well as application method and system thereof
CN208940010U (en) A kind of intranet and extranet synchronization system
CN106067879B (en) The detection method and device of information
US20140137250A1 (en) System and method for detecting final distribution site and landing site of malicious code
CN103853833A (en) Information processing method and data processing equipment
CN102377826B (en) Method for optimal placement of unpopular resource indexes in peer-to-peer network
CN105530326A (en) Method and device for detecting IP address conflict of three-layer interface
CN106878240A (en) Zombie host recognition methods and device
CN102999558A (en) Processing search queries using a data structure
CN105989002A (en) Webpage data query method and device, and method and device for establishing webpage jump path database

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant