CN105956068A - Webpage URL repetition elimination method based on distributed database - Google Patents
Webpage URL repetition elimination method based on distributed database Download PDFInfo
- Publication number
- CN105956068A CN105956068A CN201610277708.0A CN201610277708A CN105956068A CN 105956068 A CN105956068 A CN 105956068A CN 201610277708 A CN201610277708 A CN 201610277708A CN 105956068 A CN105956068 A CN 105956068A
- Authority
- CN
- China
- Prior art keywords
- url
- webpage
- distributed
- data base
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of distributed databases, and particularly relates to a webpage URL repetition elimination method based on the distributed database. The method comprises the following steps: a step S101: acquiring to-be-crawled URLs, wherein to-be-crawled webpage URLs of a webpage are acquired by distributed crawlers; a step S102: calculating hash values of the URLs; a step S103: inquiring the database, wherein the distributed crawlers compress and uniformly send the URLs in an own collection library to the distributed database for executing repetition elimination; a step S104: feeding back a result, wherein a data query result is returned back; and a step S105: data acquisition, wherein crawler nodes determine whether the webpage can be crawled or not according to the returned result. With the method mentioned above, the webpage URL repetition elimination method based on the distributed database, provided by the invention, solves a memory problem and a single point problem in a massive URL repetition elimination process better, and simultaneously guarantees high query efficiency and low collision rate.
Description
Technical field
The present invention relates to distributed data base technique field, a kind of webpage based on distributed data base
URL De-weight method.
Background technology
Webpage URL duplicate removal is to reptile important in inhibiting.Current duplicate removal strategy is broadly divided into two classes: based on interior
The De-weight method deposited, De-weight method based on disk.
De-weight method based on internal memory need in the face of internal memory overflow problem, especially in the face of magnanimity growing
Webpage URL in the case of.The most general solution is to use Bloom Filter, though this method
So solving internal memory overflow problem, but sacrifice accuracy rate, along with the increase of data volume, collision probability is also
Can increase.
There is not internal memory overflow problem in De-weight method based on disk, this kind of method typically uses data base's duplicate removal
Mode.For traditional Relational DataBase, single-point problem can be faced when processing magnanimity URL duplicate removal and look into
Ask efficiency to increase with data volume and decline problem.
Chinese invention patent CN 104809182 A discloses a kind of based on dynamically dividing Bloom Filter's
Web crawlers URL De-weight method, the basis of the method is one dynamically can divide Bloom Filter (brief note
DSBF), it and Interner Archive reptile and Apoide reptile uniformly bear URL access task
Fixed structure Bloom Filter is different, but has the dynamic scalable knot that on-demand can split into multilamellar flexibly
Structure.Although using Bloom Filter can reach to save the purpose of internal memory, but this space-efficient being with sacrificial
Premised on domestic animal accuracy rate.
Summary of the invention
The technical issues that need to address of the present invention provide the duplicate removal of a kind of distributed data base based on decentration
Method.
For solving above-mentioned technical problem, the webpage URL De-weight method based on distributed data base of the present invention,
Comprise the following steps, comprise the following steps,
Step S101: obtain URL to be crawled, is obtained, by distributed reptile, the webpage URL that webpage is to be crawled;
Step S102: calculate the hash value of URL;
Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out
Deliver to distributed data base and do duplicate removal process;
Step S104: feedback result, returns data query result state;
Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can climb
Take.
Further, described step S104 specifically includes following steps,
Step S1041: judge in data base, whether data exist;If it does not exist, then enter step S1042;
If it is present enter step S1043;
Step S1042: these data of writing direct, is then back to successfully, enters step S1044;
Step S1043: return unsuccessfully;
Step S1044: database query result special topic is returned distributed reptile place node.
Further, the hash value calculating URL described in described step S102 specially utilizes MurmurHash
Webpage URL is mapped as the hash value of long type by method.
Further, the employing decentration structure of distributed data base described in described step S103, described point
Concordance hash algorithm is used during cloth database purchase.
Further, described concordance hash algorithm uses dummy node mode, and described dummy node is exactly will
One actual physical node is divided into multiple discontinuous dummy node, when have node delay machine time, this node
Data can be evenly distributed to other physical nodes.
After using said method, the webpage URL De-weight method based on distributed data base of the present invention, well
Solve the memory problem during magnanimity URL duplicate removal, single-point problem, ensure that simultaneously high search efficiency with
Low collision rate.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Fig. 1 is the flow chart of present invention webpage based on distributed data base URL De-weight method.
Detailed description of the invention
As it is shown in figure 1, the webpage URL De-weight method based on distributed data base of the present invention, including following
Step, step S101: obtain URL to be crawled, distributed reptile obtain the webpage URL that webpage is to be crawled.
Step S102: calculate the hash value of URL;MurmurHash method is utilized to be mapped as by webpage URL
The hash value of long type.The advantage of MurmurHash is high operational performance, low collision rate.Additionally, this calculation
Method can also realize the compression to data, and then improves communication efficiency, saves memory space.
Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out
Deliver to distributed data base and do duplicate removal process.Database Systems in the present invention use the structure of decentration,
The technical way realized is concordance Hash.
Concordance hash algorithm is algorithm conventional in distributed system, and its advantage is that stability is high, supports
Dynamic expansion.A distributed storage system, store data on concrete node, if adopted
By the common method that data hash value is pressed nodes complementation, the problem that extension can be caused: add when there being machine
Entering or exit this cluster, the most all of data map all by invalid.Each is the most first saved by consistent hashing
Point is mapped on a virtual ring, the span of the Zhou Changwei hash algorithm of ring.During data storage, meeting
According to its hash value, data are assigned on nearest node clockwise.Coordinate backup policy, delay when there being node
During machine, the data that this node is responsible for can be delivered to nearest node clockwise and be responsible for storage.
One of problem that consistent hashing may cause is exactly " snowslide " problem, i.e. when have node delay machine time,
Nearest node load can be caused to steeply rise, and then make this node delay machine, the most repeatedly, whole cluster will
Lost efficacy.The present invention uses the mode of dummy node to avoid this problem.Dummy node is exactly by a reality
Physical node be divided into multiple discontinuous dummy node, when have node delay machine time, the data meeting of this node
It is evenly distributed to other physical node.
Step S104: feedback result, returns data query result state, specifically includes following steps,
Step S1041: judge in data base, whether data exist;If it does not exist, then enter step S1042;
If it is present enter step S1043;
Step S1042: these data of writing direct, is then back to successfully, enters step S1044;
Step S1043: return unsuccessfully;
Step S1044: database query result special topic is returned distributed reptile place node.
Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can crawl.
The method have the advantages that
1) duplicate removal mode based on disk is used.
Although duplicate removal mode based on internal memory can reach the highest process performance, but tackles the energy of data rapid expansion
Power is not enough.For mass data collection system as search engine, according to conventional strategy completely including
It is impossible for depositing in depositing.Current relatively effective strategy is to use BloomFilter to reach in saving
The purpose deposited, but this space-efficient is premised on sacrificing accuracy rate.Therefore, the present invention use based on
The mode of disk storage, can ignore the memory problem that mass data is caused completely.
2) distributed data base based on decentration.
The line style of Database size increases, and the response time inquiring about data base can be caused exponentially to increase.Tradition
Although data base supports Sharding mechanism in theory, but difficult in practical operation.In view of distributed
Data base is supporting there is natural advantage in this mechanism, therefore, present invention employs based on distributed data
The framework in storehouse, to ensure that data base's remains in that higher inquiry in the burgeoning scene of data volume
Efficiency.
Additionally, the present invention uses the structure of decentration, protected by data redundancy and consistent hashing strategy
The high usage route of card data and the high availability of data-base cluster.
3) utilize Murmur Hash that webpage URL is compressed.
The URL majority of webpage is long, and direct storage can expend the biggest memory space.Therefore, the present invention adopts
Realize the compression to data with Murmur hash algorithm, not only ensure that the low collision rate of duplicate removal, also improve
Data acquisition node and the communication efficiency of data-base cluster, simultaneously effective save disk storage space.
Although the foregoing describing the detailed description of the invention of the present invention, but those skilled in the art should managing
Solving, these are merely illustrative of, and present embodiment can be made various changes or modifications, without departing from sending out
Bright principle and essence, protection scope of the present invention is only limited by the claims that follow.
Claims (5)
1. a webpage URL De-weight method based on distributed data base, it is characterised in that comprise the following steps,
Step S101: obtain URL to be crawled, is obtained, by distributed reptile, the webpage URL that webpage is to be crawled;
Step S102: calculate the hash value of URL;
Step S103: inquiry data base, after the URL compression that distributed reptile will each gather in storehouse, unification is sent out
Deliver to distributed data base and do duplicate removal process;
Step S104: feedback result, returns data query result state;
Step S105: data acquisition, according to returning result phase, reptile node determines whether this webpage can climb
Take.
2. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that
Described step S104 specifically includes following steps,
Step S1041: judge in data base, whether data exist;If it does not exist, then enter step S1042;
If it is present enter step S1043;
Step S1042: these data of writing direct, is then back to successfully, enters step S1044;
Step S1043: return unsuccessfully;
Step S1044: database query result special topic is returned distributed reptile place node.
3. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that:
The hash value calculating URL described in described step S102 specially utilizes MurmurHash method by webpage URL
It is mapped as the hash value of long type.
4. according to the webpage URL De-weight method based on distributed data base described in claim 1, it is characterised in that:
Distributed data base described in described step S103 uses decentration structure, described distributed data library storage
Shi Caiyong concordance hash algorithm.
5. according to the webpage URL De-weight method based on distributed data base described in claim 4, it is characterised in that:
Described concordance hash algorithm uses dummy node mode, and described dummy node is exactly by an actual thing
Reason node division becomes multiple discontinuous dummy node, when have node delay machine time, the data of this node can be uniformly
It is assigned to other physical nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610277708.0A CN105956068A (en) | 2016-04-27 | 2016-04-27 | Webpage URL repetition elimination method based on distributed database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610277708.0A CN105956068A (en) | 2016-04-27 | 2016-04-27 | Webpage URL repetition elimination method based on distributed database |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105956068A true CN105956068A (en) | 2016-09-21 |
Family
ID=56916540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610277708.0A Pending CN105956068A (en) | 2016-04-27 | 2016-04-27 | Webpage URL repetition elimination method based on distributed database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105956068A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407485A (en) * | 2016-12-20 | 2017-02-15 | 福建六壬网安股份有限公司 | URL de-repetition method and system based on similarity comparison |
CN106484828A (en) * | 2016-09-29 | 2017-03-08 | 西南科技大学 | A kind of distributed interconnection data Fast Acquisition System and acquisition method |
CN107329969A (en) * | 2017-05-23 | 2017-11-07 | 合肥智权信息科技有限公司 | It is a kind of that system and method are updated based on the data message repeatedly verified |
CN107798106A (en) * | 2017-10-31 | 2018-03-13 | 广东思域信息科技有限公司 | A kind of URL De-weight methods in distributed reptile system |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN108628871A (en) * | 2017-03-16 | 2018-10-09 | 哈尔滨英赛克信息技术有限公司 | A kind of link De-weight method based on chain feature |
CN108874941A (en) * | 2018-06-04 | 2018-11-23 | 成都知道创宇信息技术有限公司 | Big data URL De-weight method based on convolution feature and multiple Hash mapping |
CN110275873A (en) * | 2019-06-28 | 2019-09-24 | 重庆紫光华山智安科技有限公司 | File memory method, device, storage management apparatus and storage medium |
CN111522847A (en) * | 2020-04-16 | 2020-08-11 | 山东贝赛信息科技有限公司 | Method for removing duplicate of distributed crawler website |
CN112015552A (en) * | 2020-08-27 | 2020-12-01 | 平安科技(深圳)有限公司 | Hash ring load balancing method and device, electronic equipment and storage medium |
US11695793B2 (en) | 2017-10-31 | 2023-07-04 | Micro Focus Llc | Vulnerability scanning of attack surfaces |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN102682085A (en) * | 2012-04-18 | 2012-09-19 | 北京十分科技有限公司 | Method for removing duplicated web page |
CN102799647A (en) * | 2012-06-30 | 2012-11-28 | 华为技术有限公司 | Method and device for webpage reduplication deletion |
CN103136243A (en) * | 2011-11-29 | 2013-06-05 | 中国电信股份有限公司 | File system duplicate removal method and device based on cloud storage |
CN103530369A (en) * | 2013-10-14 | 2014-01-22 | 浪潮(北京)电子信息产业有限公司 | De-weight method and system |
-
2016
- 2016-04-27 CN CN201610277708.0A patent/CN105956068A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136243A (en) * | 2011-11-29 | 2013-06-05 | 中国电信股份有限公司 | File system duplicate removal method and device based on cloud storage |
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN102682085A (en) * | 2012-04-18 | 2012-09-19 | 北京十分科技有限公司 | Method for removing duplicated web page |
CN102799647A (en) * | 2012-06-30 | 2012-11-28 | 华为技术有限公司 | Method and device for webpage reduplication deletion |
CN103530369A (en) * | 2013-10-14 | 2014-01-22 | 浪潮(北京)电子信息产业有限公司 | De-weight method and system |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484828A (en) * | 2016-09-29 | 2017-03-08 | 西南科技大学 | A kind of distributed interconnection data Fast Acquisition System and acquisition method |
CN106484828B (en) * | 2016-09-29 | 2020-01-21 | 西南科技大学 | Distributed internet data rapid acquisition system and acquisition method |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN106407485B (en) * | 2016-12-20 | 2017-12-26 | 福建六壬网安股份有限公司 | A kind of URL De-weight methods and system based on similarity-rough set |
CN106407485A (en) * | 2016-12-20 | 2017-02-15 | 福建六壬网安股份有限公司 | URL de-repetition method and system based on similarity comparison |
CN108628871A (en) * | 2017-03-16 | 2018-10-09 | 哈尔滨英赛克信息技术有限公司 | A kind of link De-weight method based on chain feature |
CN107329969A (en) * | 2017-05-23 | 2017-11-07 | 合肥智权信息科技有限公司 | It is a kind of that system and method are updated based on the data message repeatedly verified |
CN107798106A (en) * | 2017-10-31 | 2018-03-13 | 广东思域信息科技有限公司 | A kind of URL De-weight methods in distributed reptile system |
US11695793B2 (en) | 2017-10-31 | 2023-07-04 | Micro Focus Llc | Vulnerability scanning of attack surfaces |
CN108874941A (en) * | 2018-06-04 | 2018-11-23 | 成都知道创宇信息技术有限公司 | Big data URL De-weight method based on convolution feature and multiple Hash mapping |
CN108874941B (en) * | 2018-06-04 | 2021-09-21 | 成都知道创宇信息技术有限公司 | Big data URL duplication removing method based on convolution characteristics and multiple Hash mapping |
CN110275873A (en) * | 2019-06-28 | 2019-09-24 | 重庆紫光华山智安科技有限公司 | File memory method, device, storage management apparatus and storage medium |
CN111522847A (en) * | 2020-04-16 | 2020-08-11 | 山东贝赛信息科技有限公司 | Method for removing duplicate of distributed crawler website |
CN112015552A (en) * | 2020-08-27 | 2020-12-01 | 平安科技(深圳)有限公司 | Hash ring load balancing method and device, electronic equipment and storage medium |
WO2021151293A1 (en) * | 2020-08-27 | 2021-08-05 | 平安科技(深圳)有限公司 | Hash ring load balancing method and apparatus, and electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105956068A (en) | Webpage URL repetition elimination method based on distributed database | |
CN102521405B (en) | Massive structured data storage and query methods and systems supporting high-speed loading | |
CN102006330B (en) | Distributed cache system, data caching method and inquiring method of cache data | |
US9195599B2 (en) | Multi-level aggregation techniques for memory hierarchies | |
US11442961B2 (en) | Active transaction list synchronization method and apparatus | |
CN107038162A (en) | Real time data querying method and system based on database journal | |
CN103353873B (en) | Optimization implementation method and system based on the service of time measure data real-time query | |
CN105117171A (en) | Energy SCADA massive data distributed processing system and method thereof | |
CN104778188A (en) | Distributed device log collection method | |
CN102521269A (en) | Index-based computer continuous data protection method | |
CN106095589A (en) | Partition allocation method, device and system | |
CN104572505A (en) | System and method for ensuring eventual consistency of mass data caches | |
CN104407879A (en) | A power grid timing sequence large data parallel loading method | |
CN104731799A (en) | Memory database management device | |
CN105354250A (en) | Data storage method and device for cloud storage | |
CN102880671A (en) | Method for actively deleting repeated data of distributed file system | |
CN105007193A (en) | Multi-layer information processing method, system thereof and cluster management node | |
CN105159616A (en) | Disk space management method and device | |
CN108874930A (en) | File attribute information statistical method, device, system, equipment and storage medium | |
CN103870393A (en) | Cache management method and system | |
CN102404411A (en) | Data synchronization method of cloud storage system | |
CN110232095A (en) | A kind of method of data synchronization, device, storage medium and server | |
CN105975345A (en) | Video frame data dynamic equilibrium memory management method based on distributed memory | |
CN104899161A (en) | Cache method based on continuous data protection of cloud storage environment | |
CN102012946A (en) | High-efficiency safety monitoring video/image data storage method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160921 |
|
RJ01 | Rejection of invention patent application after publication |