CN107798106A

CN107798106A - A kind of URL De-weight methods in distributed reptile system

Info

Publication number: CN107798106A
Application number: CN201711047215.9A
Authority: CN
Inventors: 曾映方
Original assignee: Guangdong Civic Mdt Infotech Ltd
Current assignee: Guangdong Civic Mdt Infotech Ltd
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2018-03-13
Anticipated expiration: 2037-10-31
Also published as: CN107798106B

Abstract

One 2 is put into the invention discloses the URL De-weight methods in a kind of distributed reptile system, including by hash value^mAnnular space in, one section on each continuous Hash rings of node processing, the corresponding Bloom Filter structure of each node.URL duplicate removal process is first to calculate Hash, server node corresponding to acquisition, then judges to judge whether existed according to Bloom Filter content.The present invention is by the way of uniformity Hash and Bloom Filter are combined, Bloom Filter nodes can dynamically be increased as needed, when both can ensure that URL quantity constantly increases, Bloom Filter false positive False Rate is controlled to control in given range, Bloom Filter high efficiency can be made full use of again, suitable for constructing large-scale distributed web crawlers, efficiently crawl magnanimity web page contents are supported.

Description

A kind of URL De-weight methods in distributed reptile system

Technical field

The present invention relates to network technique field, the URL De-weight methods in specially a kind of distributed reptile system.

Background technology

Web crawlers is a kind of according to certain rule, the automatic program for capturing Web content, has been widely used in mutually Networking arenas.Reptile downloads content of pages since specified URL addresses, extracts URL addresses therein, then from these addresses Start to continue to download content.Because the URL addresses newly extracted may treat, repeated downloads can be caused by continuing with, Waste bandwidth and computing resource.URL duplicate removal technologies are widely used in network audit system, in search engine system.In distribution Captured parallel in network crawler system, it is necessary to which URL tasks are assigned into multiple servers using certain strategy, partition strategy is necessary Efficiently and it is easily achieved.Under distributed environment, the URL addresses that some main frame extracts may be by the other main frames of system It is treated, therefore system needs a kind of distributed URL duplicate removals mechanism.URL duplicate removals technology mainly considers both sides problem： URL memory spaces and URL matching speeds.URL memory space refers to the maximum number that can handle non-duplicate URL and every Memory headroom shared by URL.URL matching speeds be by judge URL record whether be repeat time used in URL come Weigh.

Bloom Filter are the effective tools for handling URL duplicate removals.The main think of of the duplicate removal scheme of BloomFilter algorithms Road generally comprises：Same URL is passed through into multiple different Hash calculation Function Mappings to the different positions in same bit array On, according to the acquisition state of the state recognition of the multiple different positions URL in its bit array (whether the URL has gathered). The advantages of BloomFilter algorithms is, it is only necessary to such a data structure of bit array is preserved in internal memory, it becomes possible to differentiate URL acquisition state, it is not necessary to preserve specific URL, the memory space of occupancy is small, while the speed for searching calculating is fast.But BloomFilter algorithms are when judging whether an element belongs to some set, it is possible to can be the member for being not belonging to this set Element is mistakenly considered to belong to this set.Therefore it is that can not accomplish accurate the shortcomings that BloomFilter algorithms, certain mistake is present Difference.

The current more existing URL duplicate removal schemes based on Bloom Filter, in the side such as efficiency, scalability and performance Shortcoming be present in face.As Internet Archive reptiles store all of each website using 32KB Bloom Filter URL.Apoide reptiles store all URL of each website using 8KB Bloom Filter, and according to website domain name Cryptographic Hash determines which Bloom Filter a certain bar URL is especially stored in.Generally speaking, in these schemes Bloom Filter organizational forms are relatively fixed.For some large-scale portal websites (such as Sina website, the www.xinhuanet.com), Due to the URL Numerous under the domain name, it is often difficult to deposit using the fixed size Bloom Filter of a limited length Storage；And for some small-scale websites, the waste of memory space certainly will be caused using excessive Bloom Filter.In addition, Some require higher information acquisition system to process performance, often dispose a large amount of networks simultaneously in network distribution environment and climb Worm carries out the Parallel Crawling of webpage.For the URL duplicate removal problems in this application scenarios, existing Bloom Filter and it Some improvement projects are difficult in adapt to mostly.

Consistent Hash is a kind of special hash algorithm.After using consistent hash algorithm, Hash table number of slots (size) Change averagely only need to remap to K/n keywords, wherein K is the quantity of keyword, and n is number of slots amount.But passing In the Hash table of system, one groove position of addition or deletion almost needs to remap to all keywords.Consistent Hash will Available node machine is mapped to the diverse location of annulus by a point in each object map to annulus side, system again.Look into Look for during machine corresponding to some object, it is necessary to which object, which is calculated, with consistent hash algorithm corresponds to position on annulus side, along circle Searched on ring side until running into some node machine, this machine is the position that object should preserve.When one node of deletion During machine, all objects preserved on this machine will be moved to next machine.Add a machine on annulus side certain At individual, next machine of this point needs corresponding object before this node being moved on new engine.Change object exists Distribution in node machine can be realized by adjusting the position of node machine.

The content of the invention

It is an object of the invention to for problems of the prior art and deficiency, the invention provides a kind of distributed URL De-weight methods in crawler system, this method can ensure the continuous increase with URL quantity, and false positive False Rate can be with Control is in certain threshold value, and and can makes full use of Bloom Filter high efficiency, so as to be more suitable for building extensive distribution The network crawler system of formula.

To achieve the above object, the present invention provides following technical scheme：

A kind of URL De-weight methods in distributed reptile system, comprise the following steps：

S1, using server cluster as unified resource pool, and hash value is put into one 2^mHash annular spaces In, each service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, often The request of the individual corresponding scope of service node processing；

S2, each node initializing Bloom Filter structures, that is, the array that a length is n-bit is initialized, is owned Bit initial value is all 0；

S3, H is calculated to the URL progress Hash newly got；

S4, the service node k according to corresponding to obtaining the H positions fallen on hash rings；

S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H [1],....,H[k-1]；

S6, search the bit map in Bloom Filter according to K hash value, bit corresponding to judgement whether be all 1, if being 1, then it is assumed that URL is repeats, into step S7, otherwise into step S8；

S7, the URL repeated is abandoned, into step S3；

S8, the URL is put into the pending queue of reptile；

S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is whole It is set to 1；

S10, record insertion log, content H, H [0], H [1] ..., H [k-1], insertion record number add 1

S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value,

Then enter step S3, otherwise into step S12；

S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are entered

Row migration, i.e., log will be write corresponding to node and carries out replay operations, corresponding H values belong to K+1 sections

The content of point, by corresponding H [0] in K+1 Bloom Filter, H [1] ..., H [k-1] are all

It is set to 1；

S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, complete

Original Bloom Filter are replaced into rear, into step S3.

Specifically, in the Bloom Filter structures, Bloom Filter false positive probability of miscarriage of justice is

Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, permits Perhaps the greatest member number c inserted is：

Compared with prior art, beneficial effects of the present invention are as follows：

The present invention is by the way of uniformity Hash and Bloom Filter are combined, with Bloom fixed in advance Filter is different, can dynamically increase Bloom Filter nodes as needed, when both can ensure that URL quantity constantly increases, Control Bloom Filter false positive False Rate to control in given range, Bloom Filter height can be made full use of again Effect property, suitable for constructing large-scale distributed web crawlers, support efficiently crawl magnanimity web page contents.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the specific method of recommendation video ads provided by the invention.

Fig. 2 is initialized obtained Bloom Filter Structure and Process schematic diagrams；

Fig. 3 is calculated H and fallen on for URL progress hash is located at AB sections, the Bloom handled by service node B on hash rings Filter Structure and Process schematic diagrams.

Fig. 4 is the schematic flow sheet that service node B handles URL.

Fig. 5 is the Bloom Filter Structure and Process schematic diagrams of increase service node.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

Referring to Fig. 1, the present invention provides a kind of technical scheme：

Technical solution of the present invention is described in further details with reference to the accompanying drawings and detailed description.

Such as Fig. 1, the URL De-weight methods in a kind of distributed reptile system, of the invention to mainly comprise the following steps initialization consistent Property the hash rings and duplicate removal to URL, and increase service node operation.Service node is assigned on hash rings, each clothes Business node is responsible for handling one section of hash value, corresponding given URL, carries out hash first and H is calculated, fall on hash rings, by Corresponding service node judges whether URL repeats, and repetition then abandons, and is otherwise inserted into Bloom Filter.Work as service node On insertion record number when reaching threshold value newly-increased service node share service pressure.

1. according to initial setting service node number N, the corresponding Bloom Filter structure of each service node, Bloom Filter is initialized, i.e., a units group is distributed in internal memory, size n, institute is promising to be set to 0.As shown in figure 1, N=here 3, large circle is hash rings in figure, span 0-2^m, A, B, C is that service node correspond to Hash and fallen on hash rings, and A is saved Point is responsible for processing CA sections, and B node is responsible for AB sections, and C nodes are responsible for BC sections.(it is for the side of diagram there was only three service nodes here Just, original Serving Node quantity can be set as needed in practical application.)

2. carrying out hash calculating for given URL, fall on hash rings, find corresponding service node.As shown in figure 3, URL progress hash, which is calculated H and fallen on, is located at AB sections on hash rings, by service node B processing.

3. service node B handles URL, URL is carried out using k Hash function H [0], H [1] ..., H is calculated [k-1], judge whether the corresponding bit position in Bloom Filter structures is 1.Returned if 1 and repeat URL, lost Abandon, otherwise return to non-duplicate URL, the URL is added into reptile task queue, the secondary outer record that also needs to inserts log, and insertion is counted Number increase by 1.

4. increase service node.When insertion record number exceeds threshold value (given threshold 0.9*c), then needing, which increases service, saves Point.At this moment need first to calculate determination node location, then imported corresponding Bloom Filter bit map information, Then plus service node is added in hash rings.It is as shown below, due to service node B insertion record reach threshold value, it is necessary to Newly-increased node D shares service pressure, is asked after node D is inserted by D node processing AD sections, B node processing DB section requests.Clothes Business node D needs to load existing bit map before service is started, i.e., is led service node B write-in daily record again Enter, the record of AD sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1.Node B can be on backstage Again write-in record is imported, the record of DB sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1, Bloom Filter before being replaced it after the completion of importing.The operation for so increasing service node just completes.

Claims

1. the URL De-weight methods in a kind of distributed reptile system, it is characterised in that comprise the following steps：

S1, using server cluster as unified resource pool, and hash value is put into one 2^mHash annular spaces in, each Service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, each service Node processing corresponds to the request of scope；

S2, each node initializing Bloom Filter structures, that is, initialize the array that a length is n-bit, all bits Position initial value is all 0；

S3, H is calculated to the URL progress Hash newly got；

S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H [1] ..., H [k-1]；

S6, the bit map in Bloom Filter searched according to K hash value, judge corresponding to bit whether all as 1, if It is 1, then it is assumed that URL is repeats, into step S7, otherwise into step

S8；

S7, the URL repeated is abandoned, into step S3；

S8, the URL is put into the pending queue of reptile；

S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is all set to 1；

S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value, into step S3, Otherwise step S12 is entered；

S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are migrated, i.e., by node Corresponding write-in log carries out replay operations, and corresponding H values belong to the content of K+1 nodes, will be corresponding in K+1 Bloom Filter H [0], H [1] ..., H [k-1] all is set to 1；

S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, after the completion of replace it is original Bloom Filter, into step S3.

2. the URL De-weight methods in distributed reptile system according to claim 1, it is characterised in that the Bloom In Filter structures, Bloom Filter false positive probability of miscarriage of justice is

Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, it is allowed to is inserted The greatest member number c entered is：