CN107798106A - A kind of URL De-weight methods in distributed reptile system - Google Patents
A kind of URL De-weight methods in distributed reptile system Download PDFInfo
- Publication number
- CN107798106A CN107798106A CN201711047215.9A CN201711047215A CN107798106A CN 107798106 A CN107798106 A CN 107798106A CN 201711047215 A CN201711047215 A CN 201711047215A CN 107798106 A CN107798106 A CN 107798106A
- Authority
- CN
- China
- Prior art keywords
- bloom filter
- url
- hash
- node
- service node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
One 2 is put into the invention discloses the URL De-weight methods in a kind of distributed reptile system, including by hash valuemAnnular space in, one section on each continuous Hash rings of node processing, the corresponding Bloom Filter structure of each node.URL duplicate removal process is first to calculate Hash, server node corresponding to acquisition, then judges to judge whether existed according to Bloom Filter content.The present invention is by the way of uniformity Hash and Bloom Filter are combined, Bloom Filter nodes can dynamically be increased as needed, when both can ensure that URL quantity constantly increases, Bloom Filter false positive False Rate is controlled to control in given range, Bloom Filter high efficiency can be made full use of again, suitable for constructing large-scale distributed web crawlers, efficiently crawl magnanimity web page contents are supported.
Description
Technical field
The present invention relates to network technique field, the URL De-weight methods in specially a kind of distributed reptile system.
Background technology
Web crawlers is a kind of according to certain rule, the automatic program for capturing Web content, has been widely used in mutually
Networking arenas.Reptile downloads content of pages since specified URL addresses, extracts URL addresses therein, then from these addresses
Start to continue to download content.Because the URL addresses newly extracted may treat, repeated downloads can be caused by continuing with,
Waste bandwidth and computing resource.URL duplicate removal technologies are widely used in network audit system, in search engine system.In distribution
Captured parallel in network crawler system, it is necessary to which URL tasks are assigned into multiple servers using certain strategy, partition strategy is necessary
Efficiently and it is easily achieved.Under distributed environment, the URL addresses that some main frame extracts may be by the other main frames of system
It is treated, therefore system needs a kind of distributed URL duplicate removals mechanism.URL duplicate removals technology mainly considers both sides problem:
URL memory spaces and URL matching speeds.URL memory space refers to the maximum number that can handle non-duplicate URL and every
Memory headroom shared by URL.URL matching speeds be by judge URL record whether be repeat time used in URL come
Weigh.
Bloom Filter are the effective tools for handling URL duplicate removals.The main think of of the duplicate removal scheme of BloomFilter algorithms
Road generally comprises:Same URL is passed through into multiple different Hash calculation Function Mappings to the different positions in same bit array
On, according to the acquisition state of the state recognition of the multiple different positions URL in its bit array (whether the URL has gathered).
The advantages of BloomFilter algorithms is, it is only necessary to such a data structure of bit array is preserved in internal memory, it becomes possible to differentiate
URL acquisition state, it is not necessary to preserve specific URL, the memory space of occupancy is small, while the speed for searching calculating is fast.But
BloomFilter algorithms are when judging whether an element belongs to some set, it is possible to can be the member for being not belonging to this set
Element is mistakenly considered to belong to this set.Therefore it is that can not accomplish accurate the shortcomings that BloomFilter algorithms, certain mistake is present
Difference.
The current more existing URL duplicate removal schemes based on Bloom Filter, in the side such as efficiency, scalability and performance
Shortcoming be present in face.As Internet Archive reptiles store all of each website using 32KB Bloom Filter
URL.Apoide reptiles store all URL of each website using 8KB Bloom Filter, and according to website domain name
Cryptographic Hash determines which Bloom Filter a certain bar URL is especially stored in.Generally speaking, in these schemes
Bloom Filter organizational forms are relatively fixed.For some large-scale portal websites (such as Sina website, the www.xinhuanet.com),
Due to the URL Numerous under the domain name, it is often difficult to deposit using the fixed size Bloom Filter of a limited length
Storage;And for some small-scale websites, the waste of memory space certainly will be caused using excessive Bloom Filter.In addition,
Some require higher information acquisition system to process performance, often dispose a large amount of networks simultaneously in network distribution environment and climb
Worm carries out the Parallel Crawling of webpage.For the URL duplicate removal problems in this application scenarios, existing Bloom Filter and it
Some improvement projects are difficult in adapt to mostly.
Consistent Hash is a kind of special hash algorithm.After using consistent hash algorithm, Hash table number of slots (size)
Change averagely only need to remap to K/n keywords, wherein K is the quantity of keyword, and n is number of slots amount.But passing
In the Hash table of system, one groove position of addition or deletion almost needs to remap to all keywords.Consistent Hash will
Available node machine is mapped to the diverse location of annulus by a point in each object map to annulus side, system again.Look into
Look for during machine corresponding to some object, it is necessary to which object, which is calculated, with consistent hash algorithm corresponds to position on annulus side, along circle
Searched on ring side until running into some node machine, this machine is the position that object should preserve.When one node of deletion
During machine, all objects preserved on this machine will be moved to next machine.Add a machine on annulus side certain
At individual, next machine of this point needs corresponding object before this node being moved on new engine.Change object exists
Distribution in node machine can be realized by adjusting the position of node machine.
The content of the invention
It is an object of the invention to for problems of the prior art and deficiency, the invention provides a kind of distributed
URL De-weight methods in crawler system, this method can ensure the continuous increase with URL quantity, and false positive False Rate can be with
Control is in certain threshold value, and and can makes full use of Bloom Filter high efficiency, so as to be more suitable for building extensive distribution
The network crawler system of formula.
To achieve the above object, the present invention provides following technical scheme:
A kind of URL De-weight methods in distributed reptile system, comprise the following steps:
S1, using server cluster as unified resource pool, and hash value is put into one 2mHash annular spaces
In, each service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, often
The request of the individual corresponding scope of service node processing;
S2, each node initializing Bloom Filter structures, that is, the array that a length is n-bit is initialized, is owned
Bit initial value is all 0;
S3, H is calculated to the URL progress Hash newly got;
S4, the service node k according to corresponding to obtaining the H positions fallen on hash rings;
S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H
[1],....,H[k-1];
S6, search the bit map in Bloom Filter according to K hash value, bit corresponding to judgement whether be all
1, if being 1, then it is assumed that URL is repeats, into step S7, otherwise into step S8;
S7, the URL repeated is abandoned, into step S3;
S8, the URL is put into the pending queue of reptile;
S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is whole
It is set to 1;
S10, record insertion log, content H, H [0], H [1] ..., H [k-1], insertion record number add 1
S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value,
Then enter step S3, otherwise into step S12;
S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are entered
Row migration, i.e., log will be write corresponding to node and carries out replay operations, corresponding H values belong to K+1 sections
The content of point, by corresponding H [0] in K+1 Bloom Filter, H [1] ..., H [k-1] are all
It is set to 1;
S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, complete
Original Bloom Filter are replaced into rear, into step S3.
Specifically, in the Bloom Filter structures, Bloom Filter false positive probability of miscarriage of justice is
Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, permits
Perhaps the greatest member number c inserted is:
Compared with prior art, beneficial effects of the present invention are as follows:
The present invention is by the way of uniformity Hash and Bloom Filter are combined, with Bloom fixed in advance
Filter is different, can dynamically increase Bloom Filter nodes as needed, when both can ensure that URL quantity constantly increases,
Control Bloom Filter false positive False Rate to control in given range, Bloom Filter height can be made full use of again
Effect property, suitable for constructing large-scale distributed web crawlers, support efficiently crawl magnanimity web page contents.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the specific method of recommendation video ads provided by the invention.
Fig. 2 is initialized obtained Bloom Filter Structure and Process schematic diagrams;
Fig. 3 is calculated H and fallen on for URL progress hash is located at AB sections, the Bloom handled by service node B on hash rings
Filter Structure and Process schematic diagrams.
Fig. 4 is the schematic flow sheet that service node B handles URL.
Fig. 5 is the Bloom Filter Structure and Process schematic diagrams of increase service node.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
Referring to Fig. 1, the present invention provides a kind of technical scheme:
Technical solution of the present invention is described in further details with reference to the accompanying drawings and detailed description.
Such as Fig. 1, the URL De-weight methods in a kind of distributed reptile system, of the invention to mainly comprise the following steps initialization consistent
Property the hash rings and duplicate removal to URL, and increase service node operation.Service node is assigned on hash rings, each clothes
Business node is responsible for handling one section of hash value, corresponding given URL, carries out hash first and H is calculated, fall on hash rings, by
Corresponding service node judges whether URL repeats, and repetition then abandons, and is otherwise inserted into Bloom Filter.Work as service node
On insertion record number when reaching threshold value newly-increased service node share service pressure.
1. according to initial setting service node number N, the corresponding Bloom Filter structure of each service node, Bloom
Filter is initialized, i.e., a units group is distributed in internal memory, size n, institute is promising to be set to 0.As shown in figure 1, N=here
3, large circle is hash rings in figure, span 0-2m, A, B, C is that service node correspond to Hash and fallen on hash rings, and A is saved
Point is responsible for processing CA sections, and B node is responsible for AB sections, and C nodes are responsible for BC sections.(it is for the side of diagram there was only three service nodes here
Just, original Serving Node quantity can be set as needed in practical application.)
2. carrying out hash calculating for given URL, fall on hash rings, find corresponding service node.As shown in figure 3,
URL progress hash, which is calculated H and fallen on, is located at AB sections on hash rings, by service node B processing.
3. service node B handles URL, URL is carried out using k Hash function H [0], H [1] ..., H is calculated
[k-1], judge whether the corresponding bit position in Bloom Filter structures is 1.Returned if 1 and repeat URL, lost
Abandon, otherwise return to non-duplicate URL, the URL is added into reptile task queue, the secondary outer record that also needs to inserts log, and insertion is counted
Number increase by 1.
4. increase service node.When insertion record number exceeds threshold value (given threshold 0.9*c), then needing, which increases service, saves
Point.At this moment need first to calculate determination node location, then imported corresponding Bloom Filter bit map information,
Then plus service node is added in hash rings.It is as shown below, due to service node B insertion record reach threshold value, it is necessary to
Newly-increased node D shares service pressure, is asked after node D is inserted by D node processing AD sections, B node processing DB section requests.Clothes
Business node D needs to load existing bit map before service is started, i.e., is led service node B write-in daily record again
Enter, the record of AD sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1.Node B can be on backstage
Again write-in record is imported, the record of DB sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1,
Bloom Filter before being replaced it after the completion of importing.The operation for so increasing service node just completes.
Claims (2)
1. the URL De-weight methods in a kind of distributed reptile system, it is characterised in that comprise the following steps:
S1, using server cluster as unified resource pool, and hash value is put into one 2mHash annular spaces in, each
Service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, each service
Node processing corresponds to the request of scope;
S2, each node initializing Bloom Filter structures, that is, initialize the array that a length is n-bit, all bits
Position initial value is all 0;
S3, H is calculated to the URL progress Hash newly got;
S4, the service node k according to corresponding to obtaining the H positions fallen on hash rings;
S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H [1] ..., H
[k-1];
S6, the bit map in Bloom Filter searched according to K hash value, judge corresponding to bit whether all as 1, if
It is 1, then it is assumed that URL is repeats, into step S7, otherwise into step
S8;
S7, the URL repeated is abandoned, into step S3;
S8, the URL is put into the pending queue of reptile;
S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is all set to 1;
S10, record insertion log, content H, H [0], H [1] ..., H [k-1], insertion record number add 1
S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value, into step S3,
Otherwise step S12 is entered;
S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are migrated, i.e., by node
Corresponding write-in log carries out replay operations, and corresponding H values belong to the content of K+1 nodes, will be corresponding in K+1 Bloom Filter
H [0], H [1] ..., H [k-1] all is set to 1;
S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, after the completion of replace it is original
Bloom Filter, into step S3.
2. the URL De-weight methods in distributed reptile system according to claim 1, it is characterised in that the Bloom
In Filter structures, Bloom Filter false positive probability of miscarriage of justice is
Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, it is allowed to is inserted
The greatest member number c entered is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711047215.9A CN107798106B (en) | 2017-10-31 | 2017-10-31 | URL duplication removing method in distributed crawler system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711047215.9A CN107798106B (en) | 2017-10-31 | 2017-10-31 | URL duplication removing method in distributed crawler system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107798106A true CN107798106A (en) | 2018-03-13 |
CN107798106B CN107798106B (en) | 2023-04-18 |
Family
ID=61547687
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711047215.9A Active CN107798106B (en) | 2017-10-31 | 2017-10-31 | URL duplication removing method in distributed crawler system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107798106B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804242A (en) * | 2018-05-23 | 2018-11-13 | 武汉斗鱼网络科技有限公司 | A kind of data counts De-weight method, system, server and storage medium |
CN108959359A (en) * | 2018-05-16 | 2018-12-07 | 顺丰科技有限公司 | A kind of uniform resource locator semanteme De-weight method, device, equipment and medium |
CN109933739A (en) * | 2019-03-01 | 2019-06-25 | 重庆邮电大学移通学院 | A kind of Web page sequencing method and system based on transition probability |
CN110399546A (en) * | 2019-07-23 | 2019-11-01 | 中南民族大学 | Link De-weight method, device, equipment and storage medium based on web crawlers |
CN110673968A (en) * | 2019-09-26 | 2020-01-10 | 科大国创软件股份有限公司 | Token ring-based public opinion monitoring target protection method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN104408182A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Method and device for processing web crawler data on distributed system |
CN104809182A (en) * | 2015-04-17 | 2015-07-29 | 东南大学 | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) |
CN105653629A (en) * | 2015-12-28 | 2016-06-08 | 湖南蚁坊软件有限公司 | Hash ring-based distributed data filter method |
CN105956068A (en) * | 2016-04-27 | 2016-09-21 | 湖南蚁坊软件有限公司 | Webpage URL repetition elimination method based on distributed database |
WO2017113324A1 (en) * | 2015-12-31 | 2017-07-06 | 孙燕群 | Regular expression-based url filtering method |
CN107145556A (en) * | 2017-04-28 | 2017-09-08 | 安徽博约信息科技股份有限公司 | General distributed parallel computing environment |
-
2017
- 2017-10-31 CN CN201711047215.9A patent/CN107798106B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN104408182A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Method and device for processing web crawler data on distributed system |
CN104809182A (en) * | 2015-04-17 | 2015-07-29 | 东南大学 | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) |
CN105653629A (en) * | 2015-12-28 | 2016-06-08 | 湖南蚁坊软件有限公司 | Hash ring-based distributed data filter method |
WO2017113324A1 (en) * | 2015-12-31 | 2017-07-06 | 孙燕群 | Regular expression-based url filtering method |
CN105956068A (en) * | 2016-04-27 | 2016-09-21 | 湖南蚁坊软件有限公司 | Webpage URL repetition elimination method based on distributed database |
CN107145556A (en) * | 2017-04-28 | 2017-09-08 | 安徽博约信息科技股份有限公司 | General distributed parallel computing environment |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959359A (en) * | 2018-05-16 | 2018-12-07 | 顺丰科技有限公司 | A kind of uniform resource locator semanteme De-weight method, device, equipment and medium |
CN108959359B (en) * | 2018-05-16 | 2022-10-11 | 顺丰科技有限公司 | Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium |
CN108804242A (en) * | 2018-05-23 | 2018-11-13 | 武汉斗鱼网络科技有限公司 | A kind of data counts De-weight method, system, server and storage medium |
CN108804242B (en) * | 2018-05-23 | 2022-03-22 | 武汉斗鱼网络科技有限公司 | Data counting and duplicate removal method, system, server and storage medium |
CN109933739A (en) * | 2019-03-01 | 2019-06-25 | 重庆邮电大学移通学院 | A kind of Web page sequencing method and system based on transition probability |
CN110399546A (en) * | 2019-07-23 | 2019-11-01 | 中南民族大学 | Link De-weight method, device, equipment and storage medium based on web crawlers |
CN110399546B (en) * | 2019-07-23 | 2022-02-08 | 中南民族大学 | Link duplicate removal method, device, equipment and storage medium based on web crawler |
CN110673968A (en) * | 2019-09-26 | 2020-01-10 | 科大国创软件股份有限公司 | Token ring-based public opinion monitoring target protection method |
Also Published As
Publication number | Publication date |
---|---|
CN107798106B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107798106A (en) | A kind of URL De-weight methods in distributed reptile system | |
US8880449B2 (en) | Methods and apparatus for computing graph similarity via signature similarity | |
CN104809182B (en) | Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter | |
CN102725755B (en) | Method and system of file access | |
CN103347068B (en) | A kind of based on Agent cluster network-caching accelerated method | |
CN104679778A (en) | Search result generating method and device | |
Cambazoglu et al. | Scalability challenges in web search engines | |
CN102006330A (en) | Distributed cache system, data caching method and inquiring method of cache data | |
CN105224606A (en) | A kind of disposal route of user ID and device | |
US20120143844A1 (en) | Multi-level coverage for crawling selection | |
CN105677904B (en) | Small documents storage method and device based on distributed file system | |
CN104111924A (en) | Database system | |
CN102880628A (en) | Hash data storage method and device | |
CN105868234A (en) | Update method and device of caching data | |
Labouseur et al. | Scalable and Robust Management of Dynamic Graph Data. | |
CN104933054B (en) | The URL storage methods and device of cache resource file, cache server | |
CN106709010A (en) | High-efficient HDFS uploading method based on massive small files and system thereof | |
CN107580052A (en) | From the network self-adapting reptile method and system of evolution | |
CN103309873B (en) | The processing method of data, apparatus and system | |
CN102378407B (en) | Object name resolution system and method in internet of things | |
CN104636368A (en) | Data retrieval method and device and server | |
CN104239376B (en) | Date storage method and device | |
CN108259544A (en) | URL querying methods and URL inquiry servers | |
CN103957252B (en) | The journal obtaining method and its system of cloud stocking system | |
CN104794158A (en) | Domain name data repeated detection and fast index method in boundscript window |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A URL Deduplication Method in Distributed Crawler System Effective date of registration: 20230625 Granted publication date: 20230418 Pledgee: Dongguan Kechuang Financing Guarantee Co.,Ltd. Pledgor: GUANGDONG SIYU INFORMATION TECHNOLOGY CO.,LTD. Registration number: Y2023980045419 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right |