CN109657118A - A kind of the URL De-weight method and its system of distributed network crawler - Google Patents
A kind of the URL De-weight method and its system of distributed network crawler Download PDFInfo
- Publication number
- CN109657118A CN109657118A CN201811392810.0A CN201811392810A CN109657118A CN 109657118 A CN109657118 A CN 109657118A CN 201811392810 A CN201811392810 A CN 201811392810A CN 109657118 A CN109657118 A CN 109657118A
- Authority
- CN
- China
- Prior art keywords
- url
- letter
- module
- node
- duplicate removal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/06—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
- H04L9/0643—Hash functions, e.g. MD5, SHA, HMAC or f9 MAC
Landscapes
- Engineering & Computer Science (AREA)
- Power Engineering (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of URL De-weight method of distributed network crawler and its systems, are related to field of data transmission.The present invention includes the following steps: step S01: obtaining the URL of webpage to be grabbed;Step S02:URL carries out MD5 compression processing;Step S03: the ciphertext of generation is cut into 16 bit arrays;Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;Step S05: judge in linkurl with the presence or absence of corresponding URL;Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;Step S07: URL each letter of the traversal to duplicate removal;Step S08: successively judge that the corresponding node of each letter whether there is.The present invention is in such a way that md5 encryption algorithm is combined with tree to the preliminary duplicate removal of URL, again by the way that the URL of preliminary duplicate removal is successively traversed each letter, secondary duplicate removal is carried out according to the corresponding node of letter, web crawlers crawl data accuracy and efficiency is improved, reduces resource space occupation rate.
Description
Technical field
The invention belongs to field of data transmission, a kind of URL De-weight method more particularly to distributed network crawler and its
System.
Background technique
Currently, the common URL duplicate removal scheme of web crawlers has duplicate removal scheme based on database and based on memory chained list
Duplicate removal scheme, these schemes have good effect in the case where URL storage capacity is little.But existing distributed reptile face
To URL storage capacity it is usually very big, need URL duplicate removal that can persistently keep Efficient Operation, and above-mentioned common URL removing repeat
Case efficient can fall sharply after crawlers run the long period or the risk of task paralysis.URL removing repeat in the prior art
Method design is not reasonable, needs to improve.
Summary of the invention
The purpose of the present invention is to provide a kind of URL De-weight method of distributed network crawler and its system, by MD5 plus
Close algorithm with the mode that combines of tree to the preliminary duplicate removal of URL, then by the way that the URL of preliminary duplicate removal is successively traversed each letter, root
Carry out secondary duplicate removal according to the corresponding node of letter, solve existing web crawlers crawl data accuracy and efficiency it is insufficient and
Occupy the problem more than resource.
In order to solve the above technical problems, the present invention is achieved by the following technical solutions:
The present invention is a kind of URL De-weight method of distributed network crawler, is included the following steps:
Step S01: the URL of webpage to be grabbed is obtained;
Step S02: the URL that will acquire carries out 16 MD5 compression processings;
Step S03: the ciphertext of generation is cut into 16 bit arrays;
Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;
Step S05: judge in linkurl with the presence or absence of corresponding URL;
If it does not exist, then URL is stored to resources bank;
If it exists, then the URL is deleted;
Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;
Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed;
Step S08: successively judge that the corresponding node of each letter whether there is;
If it exists, then by this URL duplicate removal;
If it does not exist, then this URL is stored to queue to be grabbed.
Preferably, in the step S02, URL encryption is stored in the form of a tree after the completion.
It preferably,, will be under the value and root node of [0] data a after ciphertext is cut into 16 array a in the step S03
The nodal value of direction is compared;If it exists, then the node r [1] found the next node being directed toward is compared with a [1], according to
Secondary comparison is until a [15] compare end.
Preferably, in the step S07, each node stores a letter in generalized list, when traversal URL is each alphabetical,
If the corresponding generalized list node of a certain letter is not present, need first to create root node in the node respective layer and the letter
After corresponding node, return continues to traverse.
The present invention is a kind of URL machining system of distributed network crawler, including processor and memory, the processor
Successively electrically connect with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module and memory
It connects;Described search module, for being successively read the address of a webpage to be processed and being transferred to encrypting module;The encryption mould
Block carries out MD5 compression encryption for the URL to acquisition and to URL;The conversion module, to be incited somebody to action according to disk symbol
The ciphertext array generated after encryption is converted into respective paths;Place is decrypted for the URL to preliminary duplicate removal in the deciphering module
Reason;The insertion module, for that will decrypt in the URL completed insertion generalized list;The deduplication module, for according to generalized list root
The each letter of node traverses URL carries out secondary duplicate removal.
The invention has the following advantages:
The present invention in such a way that md5 encryption algorithm is combined with tree to the preliminary duplicate removal of URL, then by by preliminary duplicate removal
URL successively traverses each letter, carries out secondary duplicate removal according to the corresponding node of letter, it is accurate to improve web crawlers crawl data
Property and efficiency, reduce resource space occupation rate.
Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of URL De-weight method block diagram of distributed network crawler of the invention;
Fig. 2 is a kind of URL machining system structural schematic diagram of distributed network crawler of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other
Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, the present invention is a kind of URL De-weight method of distributed network crawler, include the following steps:
Step S01: the URL of webpage to be grabbed is obtained;
Step S02: the URL that will acquire carries out 16 MD5 compression processings;
Step S03: the ciphertext of generation is cut into 16 bit arrays;
Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;
Step S05: judge in linkurl with the presence or absence of corresponding URL;
If it does not exist, illustrate that this URL is new URL, then store the URL to resources bank;
If it exists, illustrate have corresponding URL in linkurl, then delete the URL;
Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;
Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed;
Step S08: successively judge that the corresponding node of each letter whether there is;
If letter traverses the last letter of entire URL, corresponding node is still remained, then by this URL duplicate removal;
If it does not exist, then this URL is stored to queue to be grabbed.
Wherein, in step S02, URL encryption is stored in the form of a tree after the completion.
Wherein, in step S03, after ciphertext is cut into 16 array a, by what is be directed toward under the value of data a [0] and root node
Nodal value is compared;If it exists, then the node r [1] found the next node being directed toward is compared with a [1], is successively compared
Until a [15] compare end;If not finding a newly-built value is current relatively character, subsequent node value is next character
Node.
Wherein, in step S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, if a certain
The corresponding generalized list node of letter is not present, then needs first to create the corresponding section of the root node and the letter in the node respective layer
After point, return continues to traverse;If this corresponding node of letter exists, this node steering head pointer is entered next
Layer continues to execute traversal.
It please refers to shown in Fig. 2, the present invention is a kind of URL machining system of distributed network crawler, including processor and deposits
Reservoir, the processor successively with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module with
And memory is electrically connected;Described search module, for being successively read the address of a webpage to be processed and being transferred to encryption mould
Block;The encrypting module carries out MD5 compression encryption for the URL to acquisition and to URL;The conversion module, to root
The ciphertext array generated after encryption is converted into respective paths according to disk symbol;The deciphering module, for preliminary duplicate removal
URL is decrypted;The insertion module, for that will decrypt in the URL completed insertion generalized list;The deduplication module is used
Secondary duplicate removal is carried out in traversing each letter of URL according to generalized list root node.
It is worth noting that, included each unit is only drawn according to function logic in the above system embodiment
Point, but be not limited to the above division, as long as corresponding functions can be realized;In addition, each functional unit is specific
Title is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
In addition, those of ordinary skill in the art will appreciate that realizing all or part of the steps in the various embodiments described above method
It is that relevant hardware can be instructed to complete by program, corresponding program can store to be situated between in a computer-readable storage
In matter.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment
All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification,
It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention
Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only
It is limited by claims and its full scope and equivalent.
Claims (5)
1. a kind of URL De-weight method of distributed network crawler, which comprises the steps of:
Step S01: the URL of webpage to be grabbed is obtained;
Step S02: the URL that will acquire carries out 16 MD5 compression processings;
Step S03: the ciphertext of generation is cut into 16 bit arrays;
Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;
Step S05: judge in linkurl with the presence or absence of corresponding URL;
If it does not exist, then URL is stored to resources bank;
If it exists, then the URL is deleted;
Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;
Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed;
Step S08: successively judge that the corresponding node of each letter whether there is;
If it exists, then by this URL duplicate removal;
If it does not exist, then this URL is stored to queue to be grabbed.
2. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step
In S02, URL encryption is stored in the form of a tree after the completion.
3. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step
In S03, after ciphertext is cut into 16 array a, the nodal value being directed toward under the value of data a [0] and root node is compared;If
In the presence of the node r [1] found the next node being directed toward is compared with a [1] then, successively compares and compares knot up to a [15]
Beam.
4. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step
In S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, if the corresponding generalized list section of a certain letter
Point is not present, then after the corresponding node for needing first to create the root node and the letter in the node respective layer, return continues to traverse.
5. a kind of URL machining system of distributed network crawler as described in claim 1-4 is any one, including processor and deposit
Reservoir, it is characterised in that:
The processor successively with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module and
Memory is electrically connected;
Described search module, for being successively read the address of a webpage to be processed and being transferred to encrypting module;
The encrypting module carries out MD5 compression encryption for the URL to acquisition and to URL;
The conversion module, the ciphertext array generated after encryption is converted into respective paths according to disk symbol;
The deciphering module is decrypted for the URL to preliminary duplicate removal;
The insertion module, for that will decrypt in the URL completed insertion generalized list;
The deduplication module carries out secondary duplicate removal for traversing each letter of URL according to generalized list root node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811392810.0A CN109657118A (en) | 2018-11-21 | 2018-11-21 | A kind of the URL De-weight method and its system of distributed network crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811392810.0A CN109657118A (en) | 2018-11-21 | 2018-11-21 | A kind of the URL De-weight method and its system of distributed network crawler |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109657118A true CN109657118A (en) | 2019-04-19 |
Family
ID=66111452
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811392810.0A Pending CN109657118A (en) | 2018-11-21 | 2018-11-21 | A kind of the URL De-weight method and its system of distributed network crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109657118A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111324797A (en) * | 2020-02-20 | 2020-06-23 | 民生科技有限责任公司 | Method and device for acquiring data accurately at high speed |
CN112287201A (en) * | 2020-12-31 | 2021-01-29 | 北京精准沟通传媒科技股份有限公司 | Method, device, medium and electronic equipment for removing duplicate of crawler request |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012101158A1 (en) * | 2011-01-25 | 2012-08-02 | Openwave Systems Inc. | A method and system of determining whether a requested content element is in a cache |
CN103984753A (en) * | 2014-05-28 | 2014-08-13 | 北京京东尚科信息技术有限公司 | Method and device for extracting web crawler reduplication-removing characteristic value |
CN107844527A (en) * | 2017-10-13 | 2018-03-27 | 平安科技(深圳)有限公司 | Web page address De-weight method, electronic equipment and computer-readable recording medium |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
-
2018
- 2018-11-21 CN CN201811392810.0A patent/CN109657118A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012101158A1 (en) * | 2011-01-25 | 2012-08-02 | Openwave Systems Inc. | A method and system of determining whether a requested content element is in a cache |
CN103984753A (en) * | 2014-05-28 | 2014-08-13 | 北京京东尚科信息技术有限公司 | Method and device for extracting web crawler reduplication-removing characteristic value |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN107844527A (en) * | 2017-10-13 | 2018-03-27 | 平安科技(深圳)有限公司 | Web page address De-weight method, electronic equipment and computer-readable recording medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111324797A (en) * | 2020-02-20 | 2020-06-23 | 民生科技有限责任公司 | Method and device for acquiring data accurately at high speed |
CN111324797B (en) * | 2020-02-20 | 2023-08-11 | 民生科技有限责任公司 | Method and device for precisely acquiring data at high speed |
CN112287201A (en) * | 2020-12-31 | 2021-01-29 | 北京精准沟通传媒科技股份有限公司 | Method, device, medium and electronic equipment for removing duplicate of crawler request |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103548003B (en) | Method and system for improving the client-side fingerprint cache of deduplication system backup performance | |
CN102307206B (en) | Caching system and caching method for rapidly accessing virtual machine images based on cloud storage | |
CN104503708B (en) | The method and device of data hash storage | |
CN103678337B (en) | Data clearing method, apparatus and system | |
CN102142032B (en) | Method and system for reading and writing data of distributed file system | |
CN107045422A (en) | Distributed storage method and equipment | |
CN102662992A (en) | Method and device for storing and accessing massive small files | |
CN103186554A (en) | Distributed data mirroring method and data storage node | |
CN105069111A (en) | Similarity based data-block-grade data duplication removal method for cloud storage | |
CN104809182A (en) | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) | |
CN110321325A (en) | File inode lookup method, terminal, server, system and storage medium | |
CN103077208B (en) | URL(uniform resource locator) matched processing method and device | |
CN110399348A (en) | File deletes method, apparatus, system and computer readable storage medium again | |
CN101158954A (en) | Method for recognizing repeat data in computer storage | |
CN106407224A (en) | Method and device for file compaction in KV (Key-Value)-Store system | |
CN104184812A (en) | Multi-point data transmission method based on private cloud | |
CN109657118A (en) | A kind of the URL De-weight method and its system of distributed network crawler | |
CN103823807A (en) | Data de-duplication method, device and system | |
CN104965835B (en) | A kind of file read/write method and device of distributed file system | |
CN107798106A (en) | A kind of URL De-weight methods in distributed reptile system | |
CN113900810A (en) | Distributed graph processing method, system and storage medium | |
CN104933054A (en) | Uniform resource locator (URL) storage method and device of cache resource file, and cache server | |
CN106776795A (en) | Method for writing data and device based on Hbase databases | |
CN109544344B (en) | Block chain transaction processing method and equipment based on DAG | |
CN103049561B (en) | A kind of data compression method, storage engines and storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190419 |