CN105956068A

CN105956068A - Webpage URL repetition elimination method based on distributed database

Info

Publication number: CN105956068A
Application number: CN201610277708.0A
Authority: CN
Inventors: 陈丹; 黄三伟
Original assignee: Hunan Yi Fang Softcom Ltd
Current assignee: Hunan Yi Fang Softcom Ltd
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2016-09-21

Abstract

The invention relates to the technical field of distributed databases, and particularly relates to a webpage URL repetition elimination method based on the distributed database. The method comprises the following steps: a step S101: acquiring to-be-crawled URLs, wherein to-be-crawled webpage URLs of a webpage are acquired by distributed crawlers; a step S102: calculating hash values of the URLs; a step S103: inquiring the database, wherein the distributed crawlers compress and uniformly send the URLs in an own collection library to the distributed database for executing repetition elimination; a step S104: feeding back a result, wherein a data query result is returned back; and a step S105: data acquisition, wherein crawler nodes determine whether the webpage can be crawled or not according to the returned result. With the method mentioned above, the webpage URL repetition elimination method based on the distributed database, provided by the invention, solves a memory problem and a single point problem in a massive URL repetition elimination process better, and simultaneously guarantees high query efficiency and low collision rate.

Description

Webpage URL De-weight method based on distributed data base

Technical field

The present invention relates to distributed data base technique field, a kind of webpage based on distributed data base URL De-weight method.

Background technology

Webpage URL duplicate removal is to reptile important in inhibiting.Current duplicate removal strategy is broadly divided into two classes: based on interior The De-weight method deposited, De-weight method based on disk.

De-weight method based on internal memory need in the face of internal memory overflow problem, especially in the face of magnanimity growing Webpage URL in the case of.The most general solution is to use Bloom Filter, though this method So solving internal memory overflow problem, but sacrifice accuracy rate, along with the increase of data volume, collision probability is also Can increase.

There is not internal memory overflow problem in De-weight method based on disk, this kind of method typically uses data base's duplicate removal Mode.For traditional Relational DataBase, single-point problem can be faced when processing magnanimity URL duplicate removal and look into Ask efficiency to increase with data volume and decline problem.

Chinese invention patent CN 104809182 A discloses a kind of based on dynamically dividing Bloom Filter's Web crawlers URL De-weight method, the basis of the method is one dynamically can divide Bloom Filter (brief note DSBF), it and Interner Archive reptile and Apoide reptile uniformly bear URL access task Fixed structure Bloom Filter is different, but has the dynamic scalable knot that on-demand can split into multilamellar flexibly Structure.Although using Bloom Filter can reach to save the purpose of internal memory, but this space-efficient being with sacrificial Premised on domestic animal accuracy rate.

Summary of the invention

The technical issues that need to address of the present invention provide the duplicate removal of a kind of distributed data base based on decentration Method.

For solving above-mentioned technical problem, the webpage URL De-weight method based on distributed data base of the present invention, Comprise the following steps, comprise the following steps,

Step S101: obtain URL to be crawled, is obtained, by distributed reptile, the webpage URL that webpage is to be crawled；

Step S102: calculate the hash value of URL；

Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out Deliver to distributed data base and do duplicate removal process；

Step S104: feedback result, returns data query result state；

Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can climb Take.

Further, described step S104 specifically includes following steps,

Step S1041: judge in data base, whether data exist；If it does not exist, then enter step S1042； If it is present enter step S1043；

Step S1042: these data of writing direct, is then back to successfully, enters step S1044；

Step S1043: return unsuccessfully；

Step S1044: database query result special topic is returned distributed reptile place node.

Further, the hash value calculating URL described in described step S102 specially utilizes MurmurHash Webpage URL is mapped as the hash value of long type by method.

Further, the employing decentration structure of distributed data base described in described step S103, described point Concordance hash algorithm is used during cloth database purchase.

Further, described concordance hash algorithm uses dummy node mode, and described dummy node is exactly will One actual physical node is divided into multiple discontinuous dummy node, when have node delay machine time, this node Data can be evenly distributed to other physical nodes.

After using said method, the webpage URL De-weight method based on distributed data base of the present invention, well Solve the memory problem during magnanimity URL duplicate removal, single-point problem, ensure that simultaneously high search efficiency with Low collision rate.

Accompanying drawing explanation

Below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.

Fig. 1 is the flow chart of present invention webpage based on distributed data base URL De-weight method.

Detailed description of the invention

As it is shown in figure 1, the webpage URL De-weight method based on distributed data base of the present invention, including following Step, step S101: obtain URL to be crawled, distributed reptile obtain the webpage URL that webpage is to be crawled.

Step S102: calculate the hash value of URL；MurmurHash method is utilized to be mapped as by webpage URL The hash value of long type.The advantage of MurmurHash is high operational performance, low collision rate.Additionally, this calculation Method can also realize the compression to data, and then improves communication efficiency, saves memory space.

Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out Deliver to distributed data base and do duplicate removal process.Database Systems in the present invention use the structure of decentration, The technical way realized is concordance Hash.

Concordance hash algorithm is algorithm conventional in distributed system, and its advantage is that stability is high, supports Dynamic expansion.A distributed storage system, store data on concrete node, if adopted By the common method that data hash value is pressed nodes complementation, the problem that extension can be caused: add when there being machine Entering or exit this cluster, the most all of data map all by invalid.Each is the most first saved by consistent hashing Point is mapped on a virtual ring, the span of the Zhou Changwei hash algorithm of ring.During data storage, meeting According to its hash value, data are assigned on nearest node clockwise.Coordinate backup policy, delay when there being node During machine, the data that this node is responsible for can be delivered to nearest node clockwise and be responsible for storage.

One of problem that consistent hashing may cause is exactly " snowslide " problem, i.e. when have node delay machine time, Nearest node load can be caused to steeply rise, and then make this node delay machine, the most repeatedly, whole cluster will Lost efficacy.The present invention uses the mode of dummy node to avoid this problem.Dummy node is exactly by a reality Physical node be divided into multiple discontinuous dummy node, when have node delay machine time, the data meeting of this node It is evenly distributed to other physical node.

Step S104: feedback result, returns data query result state, specifically includes following steps,

Step S1043: return unsuccessfully；

Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can crawl.

The method have the advantages that

1) duplicate removal mode based on disk is used.

Although duplicate removal mode based on internal memory can reach the highest process performance, but tackles the energy of data rapid expansion Power is not enough.For mass data collection system as search engine, according to conventional strategy completely including It is impossible for depositing in depositing.Current relatively effective strategy is to use BloomFilter to reach in saving The purpose deposited, but this space-efficient is premised on sacrificing accuracy rate.Therefore, the present invention use based on The mode of disk storage, can ignore the memory problem that mass data is caused completely.

2) distributed data base based on decentration.

The line style of Database size increases, and the response time inquiring about data base can be caused exponentially to increase.Tradition Although data base supports Sharding mechanism in theory, but difficult in practical operation.In view of distributed Data base is supporting there is natural advantage in this mechanism, therefore, present invention employs based on distributed data The framework in storehouse, to ensure that data base's remains in that higher inquiry in the burgeoning scene of data volume Efficiency.

Additionally, the present invention uses the structure of decentration, protected by data redundancy and consistent hashing strategy The high usage route of card data and the high availability of data-base cluster.

3) utilize Murmur Hash that webpage URL is compressed.

The URL majority of webpage is long, and direct storage can expend the biggest memory space.Therefore, the present invention adopts Realize the compression to data with Murmur hash algorithm, not only ensure that the low collision rate of duplicate removal, also improve Data acquisition node and the communication efficiency of data-base cluster, simultaneously effective save disk storage space.

Although the foregoing describing the detailed description of the invention of the present invention, but those skilled in the art should managing Solving, these are merely illustrative of, and present embodiment can be made various changes or modifications, without departing from sending out Bright principle and essence, protection scope of the present invention is only limited by the claims that follow.

Claims

1. a webpage URL De-weight method based on distributed data base, it is characterised in that comprise the following steps,

Step S102: calculate the hash value of URL；

Step S104: feedback result, returns data query result state；

2. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that Described step S104 specifically includes following steps,

Step S1043: return unsuccessfully；

3. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that: The hash value calculating URL described in described step S102 specially utilizes MurmurHash method by webpage URL It is mapped as the hash value of long type.

4. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that: Distributed data base described in described step S103 uses decentration structure, described distributed data library storage Shi Caiyong concordance hash algorithm.

5. according to the webpage URL De-weight method based on distributed data base described in claim 4, it is characterised in that: Described concordance hash algorithm uses dummy node mode, and described dummy node is exactly by an actual thing Reason node division becomes multiple discontinuous dummy node, when have node delay machine time, the data of this node can be uniformly It is assigned to other physical nodes.