CN105956068A - Webpage URL repetition elimination method based on distributed database - Google Patents

Webpage URL repetition elimination method based on distributed database Download PDF

Info

Publication number
CN105956068A
CN105956068A CN201610277708.0A CN201610277708A CN105956068A CN 105956068 A CN105956068 A CN 105956068A CN 201610277708 A CN201610277708 A CN 201610277708A CN 105956068 A CN105956068 A CN 105956068A
Authority
CN
China
Prior art keywords
url
webpage
distributed
data base
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610277708.0A
Other languages
Chinese (zh)
Inventor
陈丹
黄三伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Yi Fang Softcom Ltd
Original Assignee
Hunan Yi Fang Softcom Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Yi Fang Softcom Ltd filed Critical Hunan Yi Fang Softcom Ltd
Priority to CN201610277708.0A priority Critical patent/CN105956068A/en
Publication of CN105956068A publication Critical patent/CN105956068A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of distributed databases, and particularly relates to a webpage URL repetition elimination method based on the distributed database. The method comprises the following steps: a step S101: acquiring to-be-crawled URLs, wherein to-be-crawled webpage URLs of a webpage are acquired by distributed crawlers; a step S102: calculating hash values of the URLs; a step S103: inquiring the database, wherein the distributed crawlers compress and uniformly send the URLs in an own collection library to the distributed database for executing repetition elimination; a step S104: feeding back a result, wherein a data query result is returned back; and a step S105: data acquisition, wherein crawler nodes determine whether the webpage can be crawled or not according to the returned result. With the method mentioned above, the webpage URL repetition elimination method based on the distributed database, provided by the invention, solves a memory problem and a single point problem in a massive URL repetition elimination process better, and simultaneously guarantees high query efficiency and low collision rate.

Description

Webpage URL De-weight method based on distributed data base
Technical field
The present invention relates to distributed data base technique field, a kind of webpage based on distributed data base URL De-weight method.
Background technology
Webpage URL duplicate removal is to reptile important in inhibiting.Current duplicate removal strategy is broadly divided into two classes: based on interior The De-weight method deposited, De-weight method based on disk.
De-weight method based on internal memory need in the face of internal memory overflow problem, especially in the face of magnanimity growing Webpage URL in the case of.The most general solution is to use Bloom Filter, though this method So solving internal memory overflow problem, but sacrifice accuracy rate, along with the increase of data volume, collision probability is also Can increase.
There is not internal memory overflow problem in De-weight method based on disk, this kind of method typically uses data base's duplicate removal Mode.For traditional Relational DataBase, single-point problem can be faced when processing magnanimity URL duplicate removal and look into Ask efficiency to increase with data volume and decline problem.
Chinese invention patent CN 104809182 A discloses a kind of based on dynamically dividing Bloom Filter's Web crawlers URL De-weight method, the basis of the method is one dynamically can divide Bloom Filter (brief note DSBF), it and Interner Archive reptile and Apoide reptile uniformly bear URL access task Fixed structure Bloom Filter is different, but has the dynamic scalable knot that on-demand can split into multilamellar flexibly Structure.Although using Bloom Filter can reach to save the purpose of internal memory, but this space-efficient being with sacrificial Premised on domestic animal accuracy rate.
Summary of the invention
The technical issues that need to address of the present invention provide the duplicate removal of a kind of distributed data base based on decentration Method.
For solving above-mentioned technical problem, the webpage URL De-weight method based on distributed data base of the present invention, Comprise the following steps, comprise the following steps,
Step S101: obtain URL to be crawled, is obtained, by distributed reptile, the webpage URL that webpage is to be crawled;
Step S102: calculate the hash value of URL;
Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out Deliver to distributed data base and do duplicate removal process;
Step S104: feedback result, returns data query result state;
Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can climb Take.
Further, described step S104 specifically includes following steps,
Step S1041: judge in data base, whether data exist;If it does not exist, then enter step S1042; If it is present enter step S1043;
Step S1042: these data of writing direct, is then back to successfully, enters step S1044;
Step S1043: return unsuccessfully;
Step S1044: database query result special topic is returned distributed reptile place node.
Further, the hash value calculating URL described in described step S102 specially utilizes MurmurHash Webpage URL is mapped as the hash value of long type by method.
Further, the employing decentration structure of distributed data base described in described step S103, described point Concordance hash algorithm is used during cloth database purchase.
Further, described concordance hash algorithm uses dummy node mode, and described dummy node is exactly will One actual physical node is divided into multiple discontinuous dummy node, when have node delay machine time, this node Data can be evenly distributed to other physical nodes.
After using said method, the webpage URL De-weight method based on distributed data base of the present invention, well Solve the memory problem during magnanimity URL duplicate removal, single-point problem, ensure that simultaneously high search efficiency with Low collision rate.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Fig. 1 is the flow chart of present invention webpage based on distributed data base URL De-weight method.
Detailed description of the invention
As it is shown in figure 1, the webpage URL De-weight method based on distributed data base of the present invention, including following Step, step S101: obtain URL to be crawled, distributed reptile obtain the webpage URL that webpage is to be crawled.
Step S102: calculate the hash value of URL;MurmurHash method is utilized to be mapped as by webpage URL The hash value of long type.The advantage of MurmurHash is high operational performance, low collision rate.Additionally, this calculation Method can also realize the compression to data, and then improves communication efficiency, saves memory space.
Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out Deliver to distributed data base and do duplicate removal process.Database Systems in the present invention use the structure of decentration, The technical way realized is concordance Hash.
Concordance hash algorithm is algorithm conventional in distributed system, and its advantage is that stability is high, supports Dynamic expansion.A distributed storage system, store data on concrete node, if adopted By the common method that data hash value is pressed nodes complementation, the problem that extension can be caused: add when there being machine Entering or exit this cluster, the most all of data map all by invalid.Each is the most first saved by consistent hashing Point is mapped on a virtual ring, the span of the Zhou Changwei hash algorithm of ring.During data storage, meeting According to its hash value, data are assigned on nearest node clockwise.Coordinate backup policy, delay when there being node During machine, the data that this node is responsible for can be delivered to nearest node clockwise and be responsible for storage.
One of problem that consistent hashing may cause is exactly " snowslide " problem, i.e. when have node delay machine time, Nearest node load can be caused to steeply rise, and then make this node delay machine, the most repeatedly, whole cluster will Lost efficacy.The present invention uses the mode of dummy node to avoid this problem.Dummy node is exactly by a reality Physical node be divided into multiple discontinuous dummy node, when have node delay machine time, the data meeting of this node It is evenly distributed to other physical node.
Step S104: feedback result, returns data query result state, specifically includes following steps,
Step S1041: judge in data base, whether data exist;If it does not exist, then enter step S1042; If it is present enter step S1043;
Step S1042: these data of writing direct, is then back to successfully, enters step S1044;
Step S1043: return unsuccessfully;
Step S1044: database query result special topic is returned distributed reptile place node.
Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can crawl.
The method have the advantages that
1) duplicate removal mode based on disk is used.
Although duplicate removal mode based on internal memory can reach the highest process performance, but tackles the energy of data rapid expansion Power is not enough.For mass data collection system as search engine, according to conventional strategy completely including It is impossible for depositing in depositing.Current relatively effective strategy is to use BloomFilter to reach in saving The purpose deposited, but this space-efficient is premised on sacrificing accuracy rate.Therefore, the present invention use based on The mode of disk storage, can ignore the memory problem that mass data is caused completely.
2) distributed data base based on decentration.
The line style of Database size increases, and the response time inquiring about data base can be caused exponentially to increase.Tradition Although data base supports Sharding mechanism in theory, but difficult in practical operation.In view of distributed Data base is supporting there is natural advantage in this mechanism, therefore, present invention employs based on distributed data The framework in storehouse, to ensure that data base's remains in that higher inquiry in the burgeoning scene of data volume Efficiency.
Additionally, the present invention uses the structure of decentration, protected by data redundancy and consistent hashing strategy The high usage route of card data and the high availability of data-base cluster.
3) utilize Murmur Hash that webpage URL is compressed.
The URL majority of webpage is long, and direct storage can expend the biggest memory space.Therefore, the present invention adopts Realize the compression to data with Murmur hash algorithm, not only ensure that the low collision rate of duplicate removal, also improve Data acquisition node and the communication efficiency of data-base cluster, simultaneously effective save disk storage space.
Although the foregoing describing the detailed description of the invention of the present invention, but those skilled in the art should managing Solving, these are merely illustrative of, and present embodiment can be made various changes or modifications, without departing from sending out Bright principle and essence, protection scope of the present invention is only limited by the claims that follow.

Claims (5)

1. a webpage URL De-weight method based on distributed data base, it is characterised in that comprise the following steps,
Step S101: obtain URL to be crawled, is obtained, by distributed reptile, the webpage URL that webpage is to be crawled;
Step S102: calculate the hash value of URL;
Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out Deliver to distributed data base and do duplicate removal process;
Step S104: feedback result, returns data query result state;
Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can climb Take.
2. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that Described step S104 specifically includes following steps,
Step S1041: judge in data base, whether data exist;If it does not exist, then enter step S1042; If it is present enter step S1043;
Step S1042: these data of writing direct, is then back to successfully, enters step S1044;
Step S1043: return unsuccessfully;
Step S1044: database query result special topic is returned distributed reptile place node.
3. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that: The hash value calculating URL described in described step S102 specially utilizes MurmurHash method by webpage URL It is mapped as the hash value of long type.
4. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that: Distributed data base described in described step S103 uses decentration structure, described distributed data library storage Shi Caiyong concordance hash algorithm.
5. according to the webpage URL De-weight method based on distributed data base described in claim 4, it is characterised in that: Described concordance hash algorithm uses dummy node mode, and described dummy node is exactly by an actual thing Reason node division becomes multiple discontinuous dummy node, when have node delay machine time, the data of this node can be uniformly It is assigned to other physical nodes.
CN201610277708.0A 2016-04-27 2016-04-27 Webpage URL repetition elimination method based on distributed database Pending CN105956068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610277708.0A CN105956068A (en) 2016-04-27 2016-04-27 Webpage URL repetition elimination method based on distributed database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610277708.0A CN105956068A (en) 2016-04-27 2016-04-27 Webpage URL repetition elimination method based on distributed database

Publications (1)

Publication Number Publication Date
CN105956068A true CN105956068A (en) 2016-09-21

Family

ID=56916540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610277708.0A Pending CN105956068A (en) 2016-04-27 2016-04-27 Webpage URL repetition elimination method based on distributed database

Country Status (1)

Country Link
CN (1) CN105956068A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407485A (en) * 2016-12-20 2017-02-15 福建六壬网安股份有限公司 URL de-repetition method and system based on similarity comparison
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method
CN107329969A (en) * 2017-05-23 2017-11-07 合肥智权信息科技有限公司 It is a kind of that system and method are updated based on the data message repeatedly verified
CN107798106A (en) * 2017-10-31 2018-03-13 广东思域信息科技有限公司 A kind of URL De-weight methods in distributed reptile system
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN108874941A (en) * 2018-06-04 2018-11-23 成都知道创宇信息技术有限公司 Big data URL De-weight method based on convolution feature and multiple Hash mapping
CN110275873A (en) * 2019-06-28 2019-09-24 重庆紫光华山智安科技有限公司 File memory method, device, storage management apparatus and storage medium
CN111522847A (en) * 2020-04-16 2020-08-11 山东贝赛信息科技有限公司 Method for removing duplicate of distributed crawler website
CN112015552A (en) * 2020-08-27 2020-12-01 平安科技(深圳)有限公司 Hash ring load balancing method and device, electronic equipment and storage medium
US11695793B2 (en) 2017-10-31 2023-07-04 Micro Focus Llc Vulnerability scanning of attack surfaces

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion
CN103136243A (en) * 2011-11-29 2013-06-05 中国电信股份有限公司 File system duplicate removal method and device based on cloud storage
CN103530369A (en) * 2013-10-14 2014-01-22 浪潮(北京)电子信息产业有限公司 De-weight method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136243A (en) * 2011-11-29 2013-06-05 中国电信股份有限公司 File system duplicate removal method and device based on cloud storage
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion
CN103530369A (en) * 2013-10-14 2014-01-22 浪潮(北京)电子信息产业有限公司 De-weight method and system

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method
CN106484828B (en) * 2016-09-29 2020-01-21 西南科技大学 Distributed internet data rapid acquisition system and acquisition method
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN106407485B (en) * 2016-12-20 2017-12-26 福建六壬网安股份有限公司 A kind of URL De-weight methods and system based on similarity-rough set
CN106407485A (en) * 2016-12-20 2017-02-15 福建六壬网安股份有限公司 URL de-repetition method and system based on similarity comparison
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN107329969A (en) * 2017-05-23 2017-11-07 合肥智权信息科技有限公司 It is a kind of that system and method are updated based on the data message repeatedly verified
CN107798106A (en) * 2017-10-31 2018-03-13 广东思域信息科技有限公司 A kind of URL De-weight methods in distributed reptile system
US11695793B2 (en) 2017-10-31 2023-07-04 Micro Focus Llc Vulnerability scanning of attack surfaces
CN108874941A (en) * 2018-06-04 2018-11-23 成都知道创宇信息技术有限公司 Big data URL De-weight method based on convolution feature and multiple Hash mapping
CN108874941B (en) * 2018-06-04 2021-09-21 成都知道创宇信息技术有限公司 Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping
CN110275873A (en) * 2019-06-28 2019-09-24 重庆紫光华山智安科技有限公司 File memory method, device, storage management apparatus and storage medium
CN111522847A (en) * 2020-04-16 2020-08-11 山东贝赛信息科技有限公司 Method for removing duplicate of distributed crawler website
CN112015552A (en) * 2020-08-27 2020-12-01 平安科技(深圳)有限公司 Hash ring load balancing method and device, electronic equipment and storage medium
WO2021151293A1 (en) * 2020-08-27 2021-08-05 平安科技(深圳)有限公司 Hash ring load balancing method and apparatus, and electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN105956068A (en) Webpage URL repetition elimination method based on distributed database
CN102521405B (en) Massive structured data storage and query methods and systems supporting high-speed loading
CN102006330B (en) Distributed cache system, data caching method and inquiring method of cache data
US9195599B2 (en) Multi-level aggregation techniques for memory hierarchies
US11442961B2 (en) Active transaction list synchronization method and apparatus
CN107038162A (en) Real time data querying method and system based on database journal
CN103353873B (en) Optimization implementation method and system based on the service of time measure data real-time query
CN105117171A (en) Energy SCADA massive data distributed processing system and method thereof
CN104778188A (en) Distributed device log collection method
CN102521269A (en) Index-based computer continuous data protection method
CN106095589A (en) Partition allocation method, device and system
CN104572505A (en) System and method for ensuring eventual consistency of mass data caches
CN104407879A (en) A power grid timing sequence large data parallel loading method
CN104731799A (en) Memory database management device
CN105354250A (en) Data storage method and device for cloud storage
CN102880671A (en) Method for actively deleting repeated data of distributed file system
CN105007193A (en) Multi-layer information processing method, system thereof and cluster management node
CN105159616A (en) Disk space management method and device
CN108874930A (en) File attribute information statistical method, device, system, equipment and storage medium
CN103870393A (en) Cache management method and system
CN102404411A (en) Data synchronization method of cloud storage system
CN110232095A (en) A kind of method of data synchronization, device, storage medium and server
CN105975345A (en) Video frame data dynamic equilibrium memory management method based on distributed memory
CN104899161A (en) Cache method based on continuous data protection of cloud storage environment
CN102012946A (en) High-efficiency safety monitoring video/image data storage method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160921

RJ01 Rejection of invention patent application after publication