A kind of URL De-weight methods in distributed reptile system
Technical field
The present invention relates to network technique field, the URL De-weight methods in specially a kind of distributed reptile system.
Background technology
Web crawlers is a kind of according to certain rule, the automatic program for capturing Web content, has been widely used in mutually
Networking arenas.Reptile downloads content of pages since specified URL addresses, extracts URL addresses therein, then from these addresses
Start to continue to download content.Because the URL addresses newly extracted may treat, repeated downloads can be caused by continuing with,
Waste bandwidth and computing resource.URL duplicate removal technologies are widely used in network audit system, in search engine system.In distribution
Captured parallel in network crawler system, it is necessary to which URL tasks are assigned into multiple servers using certain strategy, partition strategy is necessary
Efficiently and it is easily achieved.Under distributed environment, the URL addresses that some main frame extracts may be by the other main frames of system
It is treated, therefore system needs a kind of distributed URL duplicate removals mechanism.URL duplicate removals technology mainly considers both sides problem:
URL memory spaces and URL matching speeds.URL memory space refers to the maximum number that can handle non-duplicate URL and every
Memory headroom shared by URL.URL matching speeds be by judge URL record whether be repeat time used in URL come
Weigh.
Bloom Filter are the effective tools for handling URL duplicate removals.The main think of of the duplicate removal scheme of BloomFilter algorithms
Road generally comprises:Same URL is passed through into multiple different Hash calculation Function Mappings to the different positions in same bit array
On, according to the acquisition state of the state recognition of the multiple different positions URL in its bit array (whether the URL has gathered).
The advantages of BloomFilter algorithms is, it is only necessary to such a data structure of bit array is preserved in internal memory, it becomes possible to differentiate
URL acquisition state, it is not necessary to preserve specific URL, the memory space of occupancy is small, while the speed for searching calculating is fast.But
BloomFilter algorithms are when judging whether an element belongs to some set, it is possible to can be the member for being not belonging to this set
Element is mistakenly considered to belong to this set.Therefore it is that can not accomplish accurate the shortcomings that BloomFilter algorithms, certain mistake is present
Difference.
The current more existing URL duplicate removal schemes based on Bloom Filter, in the side such as efficiency, scalability and performance
Shortcoming be present in face.As Internet Archive reptiles store all of each website using 32KB Bloom Filter
URL.Apoide reptiles store all URL of each website using 8KB Bloom Filter, and according to website domain name
Cryptographic Hash determines which Bloom Filter a certain bar URL is especially stored in.Generally speaking, in these schemes
Bloom Filter organizational forms are relatively fixed.For some large-scale portal websites (such as Sina website, the www.xinhuanet.com),
Due to the URL Numerous under the domain name, it is often difficult to deposit using the fixed size Bloom Filter of a limited length
Storage;And for some small-scale websites, the waste of memory space certainly will be caused using excessive Bloom Filter.In addition,
Some require higher information acquisition system to process performance, often dispose a large amount of networks simultaneously in network distribution environment and climb
Worm carries out the Parallel Crawling of webpage.For the URL duplicate removal problems in this application scenarios, existing Bloom Filter and it
Some improvement projects are difficult in adapt to mostly.
Consistent Hash is a kind of special hash algorithm.After using consistent hash algorithm, Hash table number of slots (size)
Change averagely only need to remap to K/n keywords, wherein K is the quantity of keyword, and n is number of slots amount.But passing
In the Hash table of system, one groove position of addition or deletion almost needs to remap to all keywords.Consistent Hash will
Available node machine is mapped to the diverse location of annulus by a point in each object map to annulus side, system again.Look into
Look for during machine corresponding to some object, it is necessary to which object, which is calculated, with consistent hash algorithm corresponds to position on annulus side, along circle
Searched on ring side until running into some node machine, this machine is the position that object should preserve.When one node of deletion
During machine, all objects preserved on this machine will be moved to next machine.Add a machine on annulus side certain
At individual, next machine of this point needs corresponding object before this node being moved on new engine.Change object exists
Distribution in node machine can be realized by adjusting the position of node machine.
The content of the invention
It is an object of the invention to for problems of the prior art and deficiency, the invention provides a kind of distributed
URL De-weight methods in crawler system, this method can ensure the continuous increase with URL quantity, and false positive False Rate can be with
Control is in certain threshold value, and and can makes full use of Bloom Filter high efficiency, so as to be more suitable for building extensive distribution
The network crawler system of formula.
To achieve the above object, the present invention provides following technical scheme:
A kind of URL De-weight methods in distributed reptile system, comprise the following steps:
S1, using server cluster as unified resource pool, and hash value is put into one 2mHash annular spaces
In, each service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, often
The request of the individual corresponding scope of service node processing;
S2, each node initializing Bloom Filter structures, that is, the array that a length is n-bit is initialized, is owned
Bit initial value is all 0;
S3, H is calculated to the URL progress Hash newly got;
S4, the service node k according to corresponding to obtaining the H positions fallen on hash rings;
S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H
[1],....,H[k-1];
S6, search the bit map in Bloom Filter according to K hash value, bit corresponding to judgement whether be all
1, if being 1, then it is assumed that URL is repeats, into step S7, otherwise into step S8;
S7, the URL repeated is abandoned, into step S3;
S8, the URL is put into the pending queue of reptile;
S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is whole
It is set to 1;
S10, record insertion log, content H, H [0], H [1] ..., H [k-1], insertion record number add 1
S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value,
Then enter step S3, otherwise into step S12;
S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are entered
Row migration, i.e., log will be write corresponding to node and carries out replay operations, corresponding H values belong to K+1 sections
The content of point, by corresponding H [0] in K+1 Bloom Filter, H [1] ..., H [k-1] are all
It is set to 1;
S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, complete
Original Bloom Filter are replaced into rear, into step S3.
Specifically, in the Bloom Filter structures, Bloom Filter false positive probability of miscarriage of justice is
Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, permits
Perhaps the greatest member number c inserted is:
Compared with prior art, beneficial effects of the present invention are as follows:
The present invention is by the way of uniformity Hash and Bloom Filter are combined, with Bloom fixed in advance
Filter is different, can dynamically increase Bloom Filter nodes as needed, when both can ensure that URL quantity constantly increases,
Control Bloom Filter false positive False Rate to control in given range, Bloom Filter height can be made full use of again
Effect property, suitable for constructing large-scale distributed web crawlers, support efficiently crawl magnanimity web page contents.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the specific method of recommendation video ads provided by the invention.
Fig. 2 is initialized obtained Bloom Filter Structure and Process schematic diagrams;
Fig. 3 is calculated H and fallen on for URL progress hash is located at AB sections, the Bloom handled by service node B on hash rings
Filter Structure and Process schematic diagrams.
Fig. 4 is the schematic flow sheet that service node B handles URL.
Fig. 5 is the Bloom Filter Structure and Process schematic diagrams of increase service node.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
Referring to Fig. 1, the present invention provides a kind of technical scheme:
Technical solution of the present invention is described in further details with reference to the accompanying drawings and detailed description.
Such as Fig. 1, the URL De-weight methods in a kind of distributed reptile system, of the invention to mainly comprise the following steps initialization consistent
Property the hash rings and duplicate removal to URL, and increase service node operation.Service node is assigned on hash rings, each clothes
Business node is responsible for handling one section of hash value, corresponding given URL, carries out hash first and H is calculated, fall on hash rings, by
Corresponding service node judges whether URL repeats, and repetition then abandons, and is otherwise inserted into Bloom Filter.Work as service node
On insertion record number when reaching threshold value newly-increased service node share service pressure.
1. according to initial setting service node number N, the corresponding Bloom Filter structure of each service node, Bloom
Filter is initialized, i.e., a units group is distributed in internal memory, size n, institute is promising to be set to 0.As shown in figure 1, N=here
3, large circle is hash rings in figure, span 0-2m, A, B, C is that service node correspond to Hash and fallen on hash rings, and A is saved
Point is responsible for processing CA sections, and B node is responsible for AB sections, and C nodes are responsible for BC sections.(it is for the side of diagram there was only three service nodes here
Just, original Serving Node quantity can be set as needed in practical application.)
2. carrying out hash calculating for given URL, fall on hash rings, find corresponding service node.As shown in figure 3,
URL progress hash, which is calculated H and fallen on, is located at AB sections on hash rings, by service node B processing.
3. service node B handles URL, URL is carried out using k Hash function H [0], H [1] ..., H is calculated
[k-1], judge whether the corresponding bit position in Bloom Filter structures is 1.Returned if 1 and repeat URL, lost
Abandon, otherwise return to non-duplicate URL, the URL is added into reptile task queue, the secondary outer record that also needs to inserts log, and insertion is counted
Number increase by 1.
4. increase service node.When insertion record number exceeds threshold value (given threshold 0.9*c), then needing, which increases service, saves
Point.At this moment need first to calculate determination node location, then imported corresponding Bloom Filter bit map information,
Then plus service node is added in hash rings.It is as shown below, due to service node B insertion record reach threshold value, it is necessary to
Newly-increased node D shares service pressure, is asked after node D is inserted by D node processing AD sections, B node processing DB section requests.Clothes
Business node D needs to load existing bit map before service is started, i.e., is led service node B write-in daily record again
Enter, the record of AD sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1.Node B can be on backstage
Again write-in record is imported, the record of DB sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1,
Bloom Filter before being replaced it after the completion of importing.The operation for so increasing service node just completes.