CN106528567B - The update method and device of web crawlers cluster information - Google Patents
The update method and device of web crawlers cluster information Download PDFInfo
- Publication number
- CN106528567B CN106528567B CN201510579940.5A CN201510579940A CN106528567B CN 106528567 B CN106528567 B CN 106528567B CN 201510579940 A CN201510579940 A CN 201510579940A CN 106528567 B CN106528567 B CN 106528567B
- Authority
- CN
- China
- Prior art keywords
- target
- link
- web crawlers
- local
- broadcast
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
This application discloses the update methods and device of a kind of web crawlers cluster information.Wherein, each web crawlers is equipped with a local detector in web crawlers cluster, this method comprises: target local detector is inquired in the target local detector according to the message that its corresponding web crawlers is sent crawls link with the presence or absence of target, wherein, target is carried in message crawl link;Inquire crawl link there is no target when, target local detector saves target and crawls link, and sends the broadcast for carrying target and crawling link to other local detectors, so that other local detectors crawl link according to broadcast update.The relatively low technical problem of efficiency is crawled present application addresses web crawlers in the related technology.
Description
Technical field
This application involves internet crawler fields, in particular to a kind of update method of web crawlers cluster information
And device.
Background technique
Web crawlers cluster needs to filter duplicate link when crawling various websites, to prevent duplicate pages by repeatedly
It crawls.During web crawlers crawls the page, the link crawled is stored in the inspection for filtering repeated pages
In device, in order to which each crawler in web crawlers cluster is owned by detector identical as far as possible at any time, duplicate pages are avoided
Face is crawled again, and therefore, it is necessary to synchronized update detectors.
Existing scheme disposes a unified detector in the cluster, and all-network crawler can all access the same detector
Exclude duplicate pages, but this scheme makes the all-network crawler in cluster that will compete the same detector resource,
It when each web crawlers crawls the page, requires whether the link that detector inspection crawls repeats, leads to crawling for web crawlers
Efficiency is relatively low.
In view of the above-mentioned problems, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the present application provides the update method and device of a kind of web crawlers cluster information, at least to solve correlation
Web crawlers crawls the relatively low technical problem of efficiency in technology.
According to the one aspect of the embodiment of the present application, a kind of update method of web crawlers cluster information is provided, it is described
Each web crawlers is equipped with a local detector in web crawlers cluster, and target local detector is climbed according to its corresponding network
The information query that worm sends crawls link with the presence or absence of target in the target local detector, wherein carries in the message
There is the target to crawl link;Inquire crawl link there is no the target when, save the target and crawl link, and to
Other local detector transmissions carry the broadcast that the target crawls link, so that other described local detectors are according to described in
Broadcast, which updates, crawls link.
According to the another aspect of the embodiment of the present application, a kind of updating device of web crawlers cluster information, institute are additionally provided
It states each web crawlers in web crawlers cluster and is equipped with a local detector, described device includes: query unit, is used for basis
The information query that detector corresponding web crawlers in target local is sent whether there is target in the detector of the target local
Crawl link, wherein carry the target in the message and crawl link;Radio unit, for inquiring there is no institute
When stating target and crawling link, saves the target and crawl link, and carry the target to other local detectors transmissions and climb
The broadcast of link is taken, so that other described local detectors crawl link according to broadcast update.
In the embodiment of the present application, the information query sent using target local detector according to its corresponding web crawlers
Link is crawled with the presence or absence of target in the target local detector, wherein is carried target in message and is crawled link;It is inquiring
When crawling link there is no target out, saves target and crawl link, and carry target to other local detector transmissions and crawl
The broadcast of link, in a manner of making other local detectors crawl link according to broadcast update, each web crawlers passes through one
Corresponding local detector filtering repeats target and crawls link, improves and crawls efficiency.Meanwhile each local detector passes through extensively
Broadcast the information for receiving the link that synchronized update had crawled, the chain that can also have been crawled by broadcast transmission synchronized update
The information connect also ensures that between different crawlers so that the local detector in web crawlers cluster possesses consistent information
It will not repeat to crawl the same link, when multiple crawlers are performed simultaneously and crawl task, that is, can guarantee and relatively high crawl effect
Rate, and can guarantee relatively high accuracy, and then solve web crawlers in the related technology crawls the relatively low technology of efficiency
Problem.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen
Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:
Fig. 1 is the flow chart according to the update method of the web crawlers cluster information of the embodiment of the present application;
Fig. 2 is the schematic diagram according to a kind of optional web crawlers cluster topology of the embodiment of the present application;
Fig. 3 is the schematic diagram according to the updating device of the web crawlers cluster information of the embodiment of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
According to the embodiment of the present application, a kind of embodiment of the method for the update method of web crawlers cluster information is provided, is needed
It is noted that step shown in the flowchart of the accompanying drawings can be in the computer system of such as a group of computer-executable instructions
Middle execution, although also, logical order is shown in flow charts, and it in some cases, can be to be different from herein
Sequence executes shown or described step.
Fig. 1 is according to the flow chart of the update method of the web crawlers cluster information of the embodiment of the present application, the web crawlers
Each web crawlers is equipped with a local detector in cluster, as shown in Figure 1, this method comprises the following steps:
Step S102, the information query that target local detector is sent according to its corresponding web crawlers is in target local
Link is crawled with the presence or absence of target in detector, wherein is carried target in message and is crawled link.
Step S104, inquire there is no target crawl link when, target local detector save target crawl link,
And the broadcast for carrying target and crawling link is sent to other local detectors, so that other local detectors are updated according to broadcast
Crawl link.
Each web crawlers in web crawlers cluster is equipped with a local detector, and target local detector can be
Local detector corresponding to any one web crawlers in web crawlers cluster.It is inquired when using target local detector
It is out-of-date that some link is not crawled, and corresponding web crawlers can crawl the link, and target local detector passes through
To send the message that the link has been crawled, other the local detectors for receiving the broadcast store the link for broadcast, in order to
Web crawlers corresponding to the detector of the link is stored with when being crawled, filter out the link avoid repeat crawl it is same
A link.Since the local detector of each web crawlers in web crawlers cluster can receive broadcast, web crawlers
The local detector of web crawlers in cluster being capable of the information that is locally stored of synchronized update.In this embodiment, the side of broadcast
Formula realizes the information of multiple local detector synchronized update detectors, and no matter which local detector web crawlers cluster utilizes
Repeated links are filtered, repeated links can be avoided accurately to be filtered out.Since each web crawlers corresponds to a local detector,
Web crawlers carries out the inspection of repeated links using its corresponding local detector, without seizing the money of the same detector
Source improves the efficiency of filtering repeated links, and also just improve web crawlers crawls efficiency.The link storage crawled
In the local detector of each of web crawlers cluster, also allows for each crawler and pass through respective local detector filtering weight
Multiple link is all that accurately, i.e., can also improve the accuracy of filtering repeated links while improving and crawling efficiency, reach standard
Really, the effect efficiently crawled.
Optionally, local in the target according to the information query of its corresponding web crawlers transmission in target local detector
After crawling link with the presence or absence of target in detector, method further include: inquire crawl link there is no target when, target
Local detector sends the instruction for allowing to crawl to its corresponding web crawlers, so that web crawlers crawls target and crawls link;
Inquire crawl link there are target when, target local detector sends the finger for abandoning crawling to its corresponding web crawlers
It enables, so that web crawlers is abandoned crawling target and crawls link.
It it is inquired by target local detector whether is stored with target and crawl link, can find and just illustrate that the target crawls
Link had crawled, and does not need to crawl again, then corresponding web crawlers is notified not crawl link to target and crawl;It cannot
It finds and just illustrates that the target crawls link and do not crawled, can be crawled, then corresponding web crawlers is notified to climb target
Link is taken to be crawled.It crawls whether link crawled due to first inquiring the target before crawling, avoids identical
Target, which crawls link and is repeated, to be crawled.Since the link information that crawls of the local detector of each of web crawlers cluster is synchronous
, therefore, each web crawlers inquiry target can crawl whether link is crawled from corresponding local detector, avoid
The same local detector is seized, the efficiency of inquiry is improved, as a complete unit, also improves the efficiency crawled.
As shown in Fig. 2, web crawlers A, which crawls target, crawls link www.abcdefg.com, web crawlers A is in local inspection
It looks into device a and searches the Object linking, if not finding the Object linking in local detector a, web crawlers A is crawled
Target crawls link www.abcdefg.com.If finding the Object linking in local detector a, it is determined that the target
Link had crawled, and abandoned crawling the Object linking, the repetition so as to avoid same link crawls.
Specifically, web crawlers cluster further includes broadcast module, and target local detector is sent to other local detectors
It carries target and crawls the broadcast of link to include: target local detector carry target to broadcast module transmission and crawl link
Information is crawled, so that broadcast module generates broadcast according to information is crawled, and will be broadcast to by broadcast module and subscribe to broadcast
Other local detectors.Local detector sends broadcast by the broadcast module in web crawlers cluster, also receives from broadcast
The broadcast that module is sent, to realize all local detector synchronized updates in web crawlers cluster.In web crawlers cluster
Other web crawlers can receive broadcast, and record target and crawl link, realize the respective local of multiple web crawlers
The link crawled can be stored in detector.
For example, as shown in Fig. 2, web crawlers cluster includes web crawlers A, web crawlers B, web crawlers C ... network
Crawler N etc., corresponding local detector is local detector a, local detector b, the local detector local detector n of c ...,
Web crawlers cluster further includes broadcast module X, and all web crawlers for subscribing to broadcast may listen to broadcast module X transmission
Broadcast.Web crawlers A crawls target and crawls link www.abcdefg.com, and web crawlers A is searched in local detector a should
As a result Object linking does not find the Object linking in local detector a, then web crawlers A crawls target and crawls link
www.abcdefg.com.Local detector a crawls link www.abcdefg.com to broadcast module X transmission target and has crawled
Information, broadcast module X generates broadcast, which carries www.abcdefg.com.Subscribe to the local detector for having the broadcast
B, the www.abcdefg.com that broadcast carries is stored in local by the local local detector n of detector c ....In web crawlers B
Need to crawl target crawl link www.abcdefg.com when, local detector b finds the target and crawls link, then network
Crawler B no longer crawls www.abcdefg.com.Web crawlers B, which is crawled in another local detector b, does not have the link of storage
Afterwards, it can also send and crawl information, process is referring to local detector a, and details are not described herein again.
Optionally, target local detector sends the broadcast for carrying target and crawling link to other local detectors, with
So that other local detectors foundation broadcast updates is crawled link includes: that local detector is carried to other local detector transmissions
Target crawls the broadcast of link, so that other local detectors receive broadcast, and saves the target that broadcast carries and crawls link.
It, can be by web crawlers cluster due to each crawler corresponding one local detector in web crawlers cluster
It adds a crawler and corresponding local detector carrys out extended network crawler cluster, or remove one from web crawlers cluster
Crawler changes web crawlers cluster with corresponding local detector.When increasing a crawler and corresponding local detector,
It only needs corresponding local detector to subscribe to the broadcast of broadcast module, the more new information of broadcast module transmission can be received, guarantee
The update synchronizing information of multiple local detectors.In this way, the information stored in multiple local detectors is consistent, either increase
Local detector still reduces local detector, all will not influence the local detector filtering of remaining in web crawlers cluster and repeats chain
It connects, the accuracy that web crawlers crawls link will not be influenced.Since each crawler corresponds to a local detector, pairs of
When increasing or reduce crawler and corresponding local detector, will not reduce other crawlers crawls efficiency.
Through the foregoing embodiment, each web crawlers repeats target by a corresponding local detector filtering and crawls chain
It connects, improves and crawl efficiency.Meanwhile each local detector passes through the letter for the link that broadcast reception synchronized update had crawled
Breath, can also be by the information for the link that broadcast transmission synchronized update had crawled, so that the local in web crawlers cluster
Detector possesses consistent information, and also ensuring that will not repeat to crawl the same link between different crawlers, climbs multiple
Worm is performed simultaneously when crawling task, that is, can guarantee the relatively high efficiency that crawls, and can guarantee relatively high accuracy.
According to the embodiment of the present application, a kind of Installation practice of the updating device of web crawlers cluster information is additionally provided,
Each web crawlers is equipped with a local detector in web crawlers cluster, and the updating device of the web crawlers cluster information can
The update method of above-mentioned web crawlers cluster information is executed, the update method of above-mentioned web crawlers cluster information can also be by this
The updating device of web crawlers cluster information executes.
As shown in figure 3, the updating device of the web crawlers cluster information includes: that query unit 10 is used for according to target local
The information query that the corresponding web crawlers of detector is sent crawls link with the presence or absence of target in the detector of the target local,
Wherein, the target is carried in the message crawl link;Radio unit 30 is used to climb there is no the target inquiring
When taking link, saves the target and crawl link, and carry the target to other local detector transmissions and crawl link
Broadcast, so that other described local detectors crawl link according to broadcast update.
Each web crawlers in web crawlers cluster is equipped with a local detector, when using local detector judgement
Some link is not crawled out-of-date out, can be crawled to the link, and send the link by broadcasting and climbed
The message taken, the local detector for receiving the broadcast stores the link, in order to be stored with corresponding to the detector of the link
Web crawlers filters out the link and avoids repeating to crawl the same link when being crawled.Due in web crawlers cluster
The local detector of each web crawlers can receive broadcast, and therefore, the local of web crawlers in web crawlers cluster checks
Device being capable of the information that is locally stored of synchronized update.In this embodiment, it is synchronous to realize multiple local detectors for the mode of broadcast
The information of detector is updated, no matter which local detector filtering repeated links web crawlers just utilizes, and can avoid repeating chain
It connects and is accurately filtered out.Due to corresponding one local detector of each web crawlers, web crawlers utilizes its corresponding local inspection
The inspection that device carries out repeated links is looked into, without seizing the resource of the same detector, improves the effect of filtering repeated links
Rate, also just improve web crawlers crawls efficiency.The link crawled is stored in each of web crawlers cluster
In ground detector, also allow for each crawler and filter repeated links by respective local detector to be accurately, that is, to mention
Also the accuracy that filtering repeated links can be improved while height crawls efficiency, achieved the effect that it is accurate, efficiently crawled.
Optionally, device further include: the first transmission unit, for being climbed in target local detector according to its corresponding network
After the information query that worm sends crawls link with the presence or absence of target in the target local detector, mesh is not present inquiring
When mark crawls link, the instruction for allowing to crawl is sent to the corresponding web crawlers of target local detector, so that web crawlers is climbed
Target is taken to crawl link;Second transmission unit, for inquire crawl link there are target when, target local detector is to mesh
Sample the corresponding web crawlers of detector send the instruction for abandoning crawling so that web crawlers is abandoned crawling target and crawls chain
It connects.
It it is inquired by target local detector whether is stored with target and crawl link, can find and just illustrate that the target crawls
Link had crawled, and does not need to crawl again, then corresponding web crawlers is notified not crawl link to target and crawl;It cannot
It finds and just illustrates that the target crawls link and do not crawled, can be crawled, then corresponding web crawlers is notified to climb target
Link is taken to be crawled.It crawls whether link crawled due to first inquiring the target before crawling, avoids identical
Target, which crawls link and is repeated, to be crawled.Since the link information that crawls of the local detector of each of web crawlers cluster is synchronous
, therefore, each web crawlers inquiry target can crawl whether link is crawled from corresponding local detector, avoid
The same local detector is seized, the efficiency of inquiry is improved, as a complete unit, also improves the efficiency crawled.
As shown in Fig. 2, web crawlers A, which crawls target, crawls link www.abcdefg.com, web crawlers A is in local inspection
It looks into device a and searches the Object linking, if not finding the Object linking in local detector a, web crawlers A is crawled
Target crawls link www.abcdefg.com.If finding the Object linking in local detector a, it is determined that the target
Link had crawled, and abandoned crawling the Object linking, the repetition so as to avoid same link crawls.
Specifically, web crawlers cluster further includes broadcast module, and radio unit includes: sending module, is used for broadcast mould
Block send carry that target crawls link crawl information so that broadcast module generates broadcast according to information is crawled, and will broadcast
It is sent to other the local detectors for subscribing to broadcast.
Local detector sends broadcast by the broadcast module in web crawlers cluster, also receives and sends from broadcast module
Broadcast, to realize all local detector synchronized updates in web crawlers cluster.Other nets in web crawlers cluster
Network crawler can receive broadcast, and record target and crawl link, realize in the respective local detector of multiple web crawlers
The link crawled can be stored.
For example, as shown in Fig. 2, web crawlers cluster includes web crawlers A, web crawlers B, web crawlers C ... network
Crawler N etc., corresponding local detector is local detector a, local detector b, the local detector local detector n of c ...,
Web crawlers cluster further includes broadcast module X, and all web crawlers for subscribing to broadcast may listen to broadcast module X transmission
Broadcast.Web crawlers A crawls target and crawls link www.abcdefg.com, and web crawlers A is searched in local detector a should
As a result Object linking does not find the Object linking in local detector a, then web crawlers A crawls target and crawls link
www.abcdefg.com.Local detector a crawls link www.abcdefg.com to broadcast module X transmission target and has crawled
Information, broadcast module X generates broadcast, which carries www.abcdefg.com.Subscribe to the local detector for having the broadcast
B, the www.abcdefg.com that broadcast carries is stored in local by the local local detector n of detector c ....In web crawlers B
Need to crawl target crawl link www.abcdefg.com when, local detector b finds the target and crawls link, then network
Crawler B no longer crawls www.abcdefg.com.Web crawlers B, which is crawled in another local detector b, does not have the link of storage
Afterwards, it can also send and crawl information, process is referring to local detector a, and details are not described herein again.
Optionally, radio unit is also used to send the broadcast for carrying target and crawling link to other local detectors, with
So that other local detectors is received broadcast, and saves the target that broadcast carries and crawl link.
It, can be by web crawlers cluster due to each crawler corresponding one local detector in web crawlers cluster
It adds a crawler and corresponding local detector carrys out extended network crawler cluster, or remove one from web crawlers cluster
Crawler changes web crawlers cluster with corresponding local detector.When increasing a crawler and corresponding local detector,
It only needs corresponding local detector to subscribe to the broadcast of broadcast module, the more new information of broadcast module transmission can be received, guarantee
The update synchronizing information of multiple local detectors.In this way, the information stored in multiple local detectors is consistent, either increase
Local detector still reduces local detector, all will not influence the local detector filtering of remaining in web crawlers cluster and repeats chain
It connects, the accuracy that web crawlers crawls link will not be influenced.Since each crawler corresponds to a local detector, pairs of
When increasing or reduce crawler and corresponding local detector, will not reduce other crawlers crawls efficiency.
Through the foregoing embodiment, each web crawlers repeats target by a corresponding local detector filtering and crawls chain
It connects, improves and crawl efficiency.Meanwhile each local detector passes through the letter for the link that broadcast reception synchronized update had crawled
Breath, can also be by the information for the link that broadcast transmission synchronized update had crawled, so that the local in web crawlers cluster
Detector possesses consistent information, and also ensuring that will not repeat to crawl the same link between different crawlers, climbs multiple
Worm is performed simultaneously when crawling task, that is, can guarantee the relatively high efficiency that crawls, and can guarantee relatively high accuracy.
Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.
In above-described embodiment of the application, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of unit, can be one kind
Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can
To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of unit or module,
It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art
For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered
It is considered as the protection scope of the application.
Claims (8)
1. a kind of update method of web crawlers cluster information, which is characterized in that each network is climbed in the web crawlers cluster
Worm is equipped with a local detector, which comprises
Target local detector according to the message that its corresponding web crawlers is sent inquired in the target local detector whether
There are targets to crawl link, wherein carries the target in the message and crawls link;
Inquire crawl link there is no the target when, target local detector saves the target and crawls link,
And the broadcast for carrying the target and crawling link is sent to other local detectors, so that other described local detector foundations
The broadcast update crawls link;
Wherein, the link crawled is stored in the local detector of each of web crawlers cluster.
2. the method according to claim 1, wherein in target local detector according to its corresponding network
After the information query that crawler sends crawls link with the presence or absence of target in the target local detector, the method is also wrapped
It includes:
Inquire crawl link there is no the target when, target local detector is sent to its corresponding web crawlers
The instruction for allowing to crawl, so that the web crawlers crawls the target and crawls link;
Inquire crawl link there are the target when, target local detector is put to its corresponding web crawlers transmission
The instruction that crawls is abandoned, so that the web crawlers is abandoned crawling the target and crawls link.
3. described the method according to claim 1, wherein the web crawlers cluster further includes broadcast module
Target local detector sends to carry the target and crawl the broadcast of link to other local detectors
Target local detector sent to the broadcast module carry that the target crawls link crawl information so that
The broadcast module crawls information according to and generates the broadcast, and described broadcast to is subscribed to other locals of broadcast
Detector.
4. the method according to claim 1, wherein target local detector is sent out to other local detectors
It send and carries the broadcast that the target crawls link, so that other described local detectors crawl link according to broadcast update
Include:
The local detector sends the broadcast for carrying the target and crawling link to other local detectors so that it is described its
He receives the broadcast by local detector, and saves the target that the broadcast carries and crawl link.
5. a kind of updating device of web crawlers cluster information, which is characterized in that each network is climbed in the web crawlers cluster
Worm is equipped with a local detector, and described device includes:
Query unit, the information query for being sent according to the corresponding web crawlers of target local detector are local in the target
Link is crawled with the presence or absence of target in detector, wherein is carried the target in the message and is crawled link;
Radio unit, for inquire crawl link there is no the target when, save the target and crawl link, and to its
He carries the broadcast that the target crawls link at local detector transmission, so that other described local detectors are according to described wide
It broadcasts update and crawls link;
Wherein, the link crawled is stored in the local detector of each of web crawlers cluster.
6. device according to claim 5, which is characterized in that described device further include:
First transmission unit, for being existed in target local detector according to the information query that its corresponding web crawlers is sent
After crawling link with the presence or absence of target in the target local detector, inquire crawl link there is no the target when,
The instruction for allowing to crawl is sent to the corresponding web crawlers of target local detector so that the web crawlers crawl it is described
Target crawls link;
Second transmission unit, for inquire crawl link there are the target when, target local detector is to described
Detector corresponding web crawlers in target local sends the instruction for abandoning crawling, so that the web crawlers is abandoned crawling the mesh
Mark crawls link.
7. device according to claim 5, which is characterized in that the web crawlers cluster further includes broadcast module, described
Radio unit includes:
Sending module, for sent to the broadcast module carry that the target crawls link crawl information so that described
Broadcast module crawls information according to and generates the broadcast, and described other locals subscribed to and broadcasted that broadcast to are checked
Device.
8. device according to claim 7, which is characterized in that the radio unit is also used to other local detector hairs
It send and carries the broadcast that the target crawls link, so that other described local detectors receive the broadcast, and described in preservation
The target that broadcast carries crawls link.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510579940.5A CN106528567B (en) | 2015-09-11 | 2015-09-11 | The update method and device of web crawlers cluster information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510579940.5A CN106528567B (en) | 2015-09-11 | 2015-09-11 | The update method and device of web crawlers cluster information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106528567A CN106528567A (en) | 2017-03-22 |
CN106528567B true CN106528567B (en) | 2019-11-12 |
Family
ID=58348122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510579940.5A Active CN106528567B (en) | 2015-09-11 | 2015-09-11 | The update method and device of web crawlers cluster information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528567B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113965371B (en) * | 2021-10-19 | 2023-08-29 | 北京天融信网络安全技术有限公司 | Task processing method, device, terminal and storage medium in website monitoring process |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298633A (en) * | 2011-09-08 | 2011-12-28 | 厦门市美亚柏科信息股份有限公司 | Method and system for investigating repeated data in distributed mass data |
CN102932448A (en) * | 2012-10-30 | 2013-02-13 | 工业和信息化部电信传输研究所 | Distributed network crawler URL (uniform resource locator) duplicate removal system and method |
CN103067521A (en) * | 2013-01-08 | 2013-04-24 | 中国科学院声学研究所 | Distributed-type nodes and distributed-type system in a crawler cluster |
CN103258036A (en) * | 2013-05-15 | 2013-08-21 | 广州一呼百应网络技术有限公司 | Distributed real-time search engine based on p2p |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
-
2015
- 2015-09-11 CN CN201510579940.5A patent/CN106528567B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298633A (en) * | 2011-09-08 | 2011-12-28 | 厦门市美亚柏科信息股份有限公司 | Method and system for investigating repeated data in distributed mass data |
CN102932448A (en) * | 2012-10-30 | 2013-02-13 | 工业和信息化部电信传输研究所 | Distributed network crawler URL (uniform resource locator) duplicate removal system and method |
CN103067521A (en) * | 2013-01-08 | 2013-04-24 | 中国科学院声学研究所 | Distributed-type nodes and distributed-type system in a crawler cluster |
CN103258036A (en) * | 2013-05-15 | 2013-08-21 | 广州一呼百应网络技术有限公司 | Distributed real-time search engine based on p2p |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
Also Published As
Publication number | Publication date |
---|---|
CN106528567A (en) | 2017-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109413109B (en) | Heaven and earth integrated network oriented security state analysis method based on finite-state machine | |
CN107360162B (en) | Network application protection method and device | |
CN102685224B (en) | User behavior analysis method, related equipment and system | |
CN104363253B (en) | Website security detection method and device | |
CN103189836A (en) | Method for classification of objects in a graph data stream | |
CN104378389B (en) | Website security detection method and device | |
CN104363251B (en) | Website security detection method and device | |
CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
CN104363252B (en) | Website security detection method and device | |
CN103593413A (en) | Meta-search engine personalizing method based on Agent | |
GB2445084B (en) | Method and apparatus for clustered filtering in an rfid infrastructure | |
US10491606B2 (en) | Method and apparatus for providing website authentication data for search engine | |
CN107967279A (en) | The data-updating method and device of distributed data base | |
CN107438111A (en) | Method, server and the system of method and the domain name agency of inquiry of the domain name | |
CN106528567B (en) | The update method and device of web crawlers cluster information | |
CN105653580A (en) | Feature information determination and judgment methods and devices as well as application method and system thereof | |
CN208940010U (en) | A kind of intranet and extranet synchronization system | |
CN106067879B (en) | The detection method and device of information | |
US20140137250A1 (en) | System and method for detecting final distribution site and landing site of malicious code | |
CN103853833A (en) | Information processing method and data processing equipment | |
CN102377826B (en) | Method for optimal placement of unpopular resource indexes in peer-to-peer network | |
CN105530326A (en) | Method and device for detecting IP address conflict of three-layer interface | |
CN106878240A (en) | Zombie host recognition methods and device | |
CN102999558A (en) | Processing search queries using a data structure | |
CN105989002A (en) | Webpage data query method and device, and method and device for establishing webpage jump path database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |