CN107798106A - A kind of URL De-weight methods in distributed reptile system - Google Patents

A kind of URL De-weight methods in distributed reptile system Download PDF

Info

Publication number
CN107798106A
CN107798106A CN201711047215.9A CN201711047215A CN107798106A CN 107798106 A CN107798106 A CN 107798106A CN 201711047215 A CN201711047215 A CN 201711047215A CN 107798106 A CN107798106 A CN 107798106A
Authority
CN
China
Prior art keywords
bloom filter
url
hash
node
service node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711047215.9A
Other languages
Chinese (zh)
Other versions
CN107798106B (en
Inventor
曾映方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Civic Mdt Infotech Ltd
Original Assignee
Guangdong Civic Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Civic Mdt Infotech Ltd filed Critical Guangdong Civic Mdt Infotech Ltd
Priority to CN201711047215.9A priority Critical patent/CN107798106B/en
Publication of CN107798106A publication Critical patent/CN107798106A/en
Application granted granted Critical
Publication of CN107798106B publication Critical patent/CN107798106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

One 2 is put into the invention discloses the URL De-weight methods in a kind of distributed reptile system, including by hash valuemAnnular space in, one section on each continuous Hash rings of node processing, the corresponding Bloom Filter structure of each node.URL duplicate removal process is first to calculate Hash, server node corresponding to acquisition, then judges to judge whether existed according to Bloom Filter content.The present invention is by the way of uniformity Hash and Bloom Filter are combined, Bloom Filter nodes can dynamically be increased as needed, when both can ensure that URL quantity constantly increases, Bloom Filter false positive False Rate is controlled to control in given range, Bloom Filter high efficiency can be made full use of again, suitable for constructing large-scale distributed web crawlers, efficiently crawl magnanimity web page contents are supported.

Description

A kind of URL De-weight methods in distributed reptile system
Technical field
The present invention relates to network technique field, the URL De-weight methods in specially a kind of distributed reptile system.
Background technology
Web crawlers is a kind of according to certain rule, the automatic program for capturing Web content, has been widely used in mutually Networking arenas.Reptile downloads content of pages since specified URL addresses, extracts URL addresses therein, then from these addresses Start to continue to download content.Because the URL addresses newly extracted may treat, repeated downloads can be caused by continuing with, Waste bandwidth and computing resource.URL duplicate removal technologies are widely used in network audit system, in search engine system.In distribution Captured parallel in network crawler system, it is necessary to which URL tasks are assigned into multiple servers using certain strategy, partition strategy is necessary Efficiently and it is easily achieved.Under distributed environment, the URL addresses that some main frame extracts may be by the other main frames of system It is treated, therefore system needs a kind of distributed URL duplicate removals mechanism.URL duplicate removals technology mainly considers both sides problem: URL memory spaces and URL matching speeds.URL memory space refers to the maximum number that can handle non-duplicate URL and every Memory headroom shared by URL.URL matching speeds be by judge URL record whether be repeat time used in URL come Weigh.
Bloom Filter are the effective tools for handling URL duplicate removals.The main think of of the duplicate removal scheme of BloomFilter algorithms Road generally comprises:Same URL is passed through into multiple different Hash calculation Function Mappings to the different positions in same bit array On, according to the acquisition state of the state recognition of the multiple different positions URL in its bit array (whether the URL has gathered). The advantages of BloomFilter algorithms is, it is only necessary to such a data structure of bit array is preserved in internal memory, it becomes possible to differentiate URL acquisition state, it is not necessary to preserve specific URL, the memory space of occupancy is small, while the speed for searching calculating is fast.But BloomFilter algorithms are when judging whether an element belongs to some set, it is possible to can be the member for being not belonging to this set Element is mistakenly considered to belong to this set.Therefore it is that can not accomplish accurate the shortcomings that BloomFilter algorithms, certain mistake is present Difference.
The current more existing URL duplicate removal schemes based on Bloom Filter, in the side such as efficiency, scalability and performance Shortcoming be present in face.As Internet Archive reptiles store all of each website using 32KB Bloom Filter URL.Apoide reptiles store all URL of each website using 8KB Bloom Filter, and according to website domain name Cryptographic Hash determines which Bloom Filter a certain bar URL is especially stored in.Generally speaking, in these schemes Bloom Filter organizational forms are relatively fixed.For some large-scale portal websites (such as Sina website, the www.xinhuanet.com), Due to the URL Numerous under the domain name, it is often difficult to deposit using the fixed size Bloom Filter of a limited length Storage;And for some small-scale websites, the waste of memory space certainly will be caused using excessive Bloom Filter.In addition, Some require higher information acquisition system to process performance, often dispose a large amount of networks simultaneously in network distribution environment and climb Worm carries out the Parallel Crawling of webpage.For the URL duplicate removal problems in this application scenarios, existing Bloom Filter and it Some improvement projects are difficult in adapt to mostly.
Consistent Hash is a kind of special hash algorithm.After using consistent hash algorithm, Hash table number of slots (size) Change averagely only need to remap to K/n keywords, wherein K is the quantity of keyword, and n is number of slots amount.But passing In the Hash table of system, one groove position of addition or deletion almost needs to remap to all keywords.Consistent Hash will Available node machine is mapped to the diverse location of annulus by a point in each object map to annulus side, system again.Look into Look for during machine corresponding to some object, it is necessary to which object, which is calculated, with consistent hash algorithm corresponds to position on annulus side, along circle Searched on ring side until running into some node machine, this machine is the position that object should preserve.When one node of deletion During machine, all objects preserved on this machine will be moved to next machine.Add a machine on annulus side certain At individual, next machine of this point needs corresponding object before this node being moved on new engine.Change object exists Distribution in node machine can be realized by adjusting the position of node machine.
The content of the invention
It is an object of the invention to for problems of the prior art and deficiency, the invention provides a kind of distributed URL De-weight methods in crawler system, this method can ensure the continuous increase with URL quantity, and false positive False Rate can be with Control is in certain threshold value, and and can makes full use of Bloom Filter high efficiency, so as to be more suitable for building extensive distribution The network crawler system of formula.
To achieve the above object, the present invention provides following technical scheme:
A kind of URL De-weight methods in distributed reptile system, comprise the following steps:
S1, using server cluster as unified resource pool, and hash value is put into one 2mHash annular spaces In, each service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, often The request of the individual corresponding scope of service node processing;
S2, each node initializing Bloom Filter structures, that is, the array that a length is n-bit is initialized, is owned Bit initial value is all 0;
S3, H is calculated to the URL progress Hash newly got;
S4, the service node k according to corresponding to obtaining the H positions fallen on hash rings;
S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H [1],....,H[k-1];
S6, search the bit map in Bloom Filter according to K hash value, bit corresponding to judgement whether be all 1, if being 1, then it is assumed that URL is repeats, into step S7, otherwise into step S8;
S7, the URL repeated is abandoned, into step S3;
S8, the URL is put into the pending queue of reptile;
S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is whole It is set to 1;
S10, record insertion log, content H, H [0], H [1] ..., H [k-1], insertion record number add 1
S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value,
Then enter step S3, otherwise into step S12;
S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are entered
Row migration, i.e., log will be write corresponding to node and carries out replay operations, corresponding H values belong to K+1 sections
The content of point, by corresponding H [0] in K+1 Bloom Filter, H [1] ..., H [k-1] are all
It is set to 1;
S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, complete
Original Bloom Filter are replaced into rear, into step S3.
Specifically, in the Bloom Filter structures, Bloom Filter false positive probability of miscarriage of justice is
Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, permits Perhaps the greatest member number c inserted is:
Compared with prior art, beneficial effects of the present invention are as follows:
The present invention is by the way of uniformity Hash and Bloom Filter are combined, with Bloom fixed in advance Filter is different, can dynamically increase Bloom Filter nodes as needed, when both can ensure that URL quantity constantly increases, Control Bloom Filter false positive False Rate to control in given range, Bloom Filter height can be made full use of again Effect property, suitable for constructing large-scale distributed web crawlers, support efficiently crawl magnanimity web page contents.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the specific method of recommendation video ads provided by the invention.
Fig. 2 is initialized obtained Bloom Filter Structure and Process schematic diagrams;
Fig. 3 is calculated H and fallen on for URL progress hash is located at AB sections, the Bloom handled by service node B on hash rings Filter Structure and Process schematic diagrams.
Fig. 4 is the schematic flow sheet that service node B handles URL.
Fig. 5 is the Bloom Filter Structure and Process schematic diagrams of increase service node.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
Referring to Fig. 1, the present invention provides a kind of technical scheme:
Technical solution of the present invention is described in further details with reference to the accompanying drawings and detailed description.
Such as Fig. 1, the URL De-weight methods in a kind of distributed reptile system, of the invention to mainly comprise the following steps initialization consistent Property the hash rings and duplicate removal to URL, and increase service node operation.Service node is assigned on hash rings, each clothes Business node is responsible for handling one section of hash value, corresponding given URL, carries out hash first and H is calculated, fall on hash rings, by Corresponding service node judges whether URL repeats, and repetition then abandons, and is otherwise inserted into Bloom Filter.Work as service node On insertion record number when reaching threshold value newly-increased service node share service pressure.
1. according to initial setting service node number N, the corresponding Bloom Filter structure of each service node, Bloom Filter is initialized, i.e., a units group is distributed in internal memory, size n, institute is promising to be set to 0.As shown in figure 1, N=here 3, large circle is hash rings in figure, span 0-2m, A, B, C is that service node correspond to Hash and fallen on hash rings, and A is saved Point is responsible for processing CA sections, and B node is responsible for AB sections, and C nodes are responsible for BC sections.(it is for the side of diagram there was only three service nodes here Just, original Serving Node quantity can be set as needed in practical application.)
2. carrying out hash calculating for given URL, fall on hash rings, find corresponding service node.As shown in figure 3, URL progress hash, which is calculated H and fallen on, is located at AB sections on hash rings, by service node B processing.
3. service node B handles URL, URL is carried out using k Hash function H [0], H [1] ..., H is calculated [k-1], judge whether the corresponding bit position in Bloom Filter structures is 1.Returned if 1 and repeat URL, lost Abandon, otherwise return to non-duplicate URL, the URL is added into reptile task queue, the secondary outer record that also needs to inserts log, and insertion is counted Number increase by 1.
4. increase service node.When insertion record number exceeds threshold value (given threshold 0.9*c), then needing, which increases service, saves Point.At this moment need first to calculate determination node location, then imported corresponding Bloom Filter bit map information, Then plus service node is added in hash rings.It is as shown below, due to service node B insertion record reach threshold value, it is necessary to Newly-increased node D shares service pressure, is asked after node D is inserted by D node processing AD sections, B node processing DB section requests.Clothes Business node D needs to load existing bit map before service is started, i.e., is led service node B write-in daily record again Enter, the record of AD sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1.Node B can be on backstage Again write-in record is imported, the record of DB sections is belonged to for H values, by corresponding H [0], H [1] ..., H [k-1] position are 1, Bloom Filter before being replaced it after the completion of importing.The operation for so increasing service node just completes.

Claims (2)

1. the URL De-weight methods in a kind of distributed reptile system, it is characterised in that comprise the following steps:
S1, using server cluster as unified resource pool, and hash value is put into one 2mHash annular spaces in, each Service node also serves as object and is put into Hash rings, the corresponding Bloom Filter structure of each service node, each service Node processing corresponds to the request of scope;
S2, each node initializing Bloom Filter structures, that is, initialize the array that a length is n-bit, all bits Position initial value is all 0;
S3, H is calculated to the URL progress Hash newly got;
S4, the service node k according to corresponding to obtaining the H positions fallen on hash rings;
S5, corresponding server k are calculated with K Hash function URL, obtain K cryptographic Hash H [0], H [1] ..., H [k-1];
S6, the bit map in Bloom Filter searched according to K hash value, judge corresponding to bit whether all as 1, if It is 1, then it is assumed that URL is repeats, into step S7, otherwise into step
S8;
S7, the URL repeated is abandoned, into step S3;
S8, the URL is put into the pending queue of reptile;
S9, by corresponding H [0] in server node k Bloom Filter, H [1] ..., H [k-1] position is all set to 1;
S10, record insertion log, content H, H [0], H [1] ..., H [k-1], insertion record number add 1
S11, judge whether Bloom Filter utilization rate reaches threshold value, if not up to threshold value, into step S3, Otherwise step S12 is entered;
S12, increase node, newly-generated node K+1, Bloom Filter contents corresponding to node are migrated, i.e., by node Corresponding write-in log carries out replay operations, and corresponding H values belong to the content of K+1 nodes, will be corresponding in K+1 Bloom Filter H [0], H [1] ..., H [k-1] all is set to 1;
S13, corresponding node k carry out similar step S12 operation, generate new Bloom Filter, after the completion of replace it is original Bloom Filter, into step S3.
2. the URL De-weight methods in distributed reptile system according to claim 1, it is characterised in that the Bloom In Filter structures, Bloom Filter false positive probability of miscarriage of justice is
Wherein k is hash function numbers, and m is bit sum, and c is insertion element number, then in given f, k, m, it is allowed to is inserted The greatest member number c entered is:
CN201711047215.9A 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system Active CN107798106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711047215.9A CN107798106B (en) 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711047215.9A CN107798106B (en) 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system

Publications (2)

Publication Number Publication Date
CN107798106A true CN107798106A (en) 2018-03-13
CN107798106B CN107798106B (en) 2023-04-18

Family

ID=61547687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711047215.9A Active CN107798106B (en) 2017-10-31 2017-10-31 URL duplication removing method in distributed crawler system

Country Status (1)

Country Link
CN (1) CN107798106B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804242A (en) * 2018-05-23 2018-11-13 武汉斗鱼网络科技有限公司 A kind of data counts De-weight method, system, server and storage medium
CN108959359A (en) * 2018-05-16 2018-12-07 顺丰科技有限公司 A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN109933739A (en) * 2019-03-01 2019-06-25 重庆邮电大学移通学院 A kind of Web page sequencing method and system based on transition probability
CN110399546A (en) * 2019-07-23 2019-11-01 中南民族大学 Link De-weight method, device, equipment and storage medium based on web crawlers
CN110673968A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Token ring-based public opinion monitoring target protection method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN105653629A (en) * 2015-12-28 2016-06-08 湖南蚁坊软件有限公司 Hash ring-based distributed data filter method
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method
CN107145556A (en) * 2017-04-28 2017-09-08 安徽博约信息科技股份有限公司 General distributed parallel computing environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN105653629A (en) * 2015-12-28 2016-06-08 湖南蚁坊软件有限公司 Hash ring-based distributed data filter method
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method
CN105956068A (en) * 2016-04-27 2016-09-21 湖南蚁坊软件有限公司 Webpage URL repetition elimination method based on distributed database
CN107145556A (en) * 2017-04-28 2017-09-08 安徽博约信息科技股份有限公司 General distributed parallel computing environment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959359A (en) * 2018-05-16 2018-12-07 顺丰科技有限公司 A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN108959359B (en) * 2018-05-16 2022-10-11 顺丰科技有限公司 Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium
CN108804242A (en) * 2018-05-23 2018-11-13 武汉斗鱼网络科技有限公司 A kind of data counts De-weight method, system, server and storage medium
CN108804242B (en) * 2018-05-23 2022-03-22 武汉斗鱼网络科技有限公司 Data counting and duplicate removal method, system, server and storage medium
CN109933739A (en) * 2019-03-01 2019-06-25 重庆邮电大学移通学院 A kind of Web page sequencing method and system based on transition probability
CN110399546A (en) * 2019-07-23 2019-11-01 中南民族大学 Link De-weight method, device, equipment and storage medium based on web crawlers
CN110399546B (en) * 2019-07-23 2022-02-08 中南民族大学 Link duplicate removal method, device, equipment and storage medium based on web crawler
CN110673968A (en) * 2019-09-26 2020-01-10 科大国创软件股份有限公司 Token ring-based public opinion monitoring target protection method

Also Published As

Publication number Publication date
CN107798106B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN107798106A (en) A kind of URL De-weight methods in distributed reptile system
US8019708B2 (en) Methods and apparatus for computing graph similarity via signature similarity
CN103902386B (en) Multi-thread network crawler processing method based on connection proxy optimal management
CN104809182B (en) Based on the web crawlers URL De-weight method that dynamically can divide Bloom Filter
CN102725755B (en) Method and system of file access
CN103347068B (en) A kind of based on Agent cluster network-caching accelerated method
CN104679778A (en) Search result generating method and device
CN102006330A (en) Distributed cache system, data caching method and inquiring method of cache data
US20120143844A1 (en) Multi-level coverage for crawling selection
CN105677904B (en) Small documents storage method and device based on distributed file system
CN102946320B (en) Distributed supervision method and system for user behavior log forecasting network
CN104111924A (en) Database system
CN102880628A (en) Hash data storage method and device
CN104346345A (en) Data storage method and device
CN102253991A (en) Uniform resource locator (URL) storage method, web filtering method, device and system
CN103309873B (en) The processing method of data, apparatus and system
CN107203623B (en) Load balancing and adjusting method of web crawler system
CN107580052A (en) From the network self-adapting reptile method and system of evolution
CN102378407B (en) Object name resolution system and method in internet of things
CN104636368A (en) Data retrieval method and device and server
Alikhan et al. Dingo optimization based network bandwidth selection to reduce processing time during data upload and access from cloud by user
CN104239376B (en) Date storage method and device
CN108259544A (en) URL querying methods and URL inquiry servers
CN102377826B (en) Method for optimal placement of unpopular resource indexes in peer-to-peer network
CN106557503A (en) A kind of method and system of image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A URL Deduplication Method in Distributed Crawler System

Effective date of registration: 20230625

Granted publication date: 20230418

Pledgee: Dongguan Kechuang Financing Guarantee Co.,Ltd.

Pledgor: GUANGDONG SIYU INFORMATION TECHNOLOGY CO.,LTD.

Registration number: Y2023980045419

PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20230418

Pledgee: Dongguan Kechuang Financing Guarantee Co.,Ltd.

Pledgor: GUANGDONG SIYU INFORMATION TECHNOLOGY CO.,LTD.

Registration number: Y2023980045419