CN109657118A - A kind of the URL De-weight method and its system of distributed network crawler - Google Patents

A kind of the URL De-weight method and its system of distributed network crawler Download PDF

Info

Publication number
CN109657118A
CN109657118A CN201811392810.0A CN201811392810A CN109657118A CN 109657118 A CN109657118 A CN 109657118A CN 201811392810 A CN201811392810 A CN 201811392810A CN 109657118 A CN109657118 A CN 109657118A
Authority
CN
China
Prior art keywords
url
letter
module
node
duplicate removal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811392810.0A
Other languages
Chinese (zh)
Inventor
胡翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cloud Finance Information Technology Co Ltd
Original Assignee
Anhui Cloud Finance Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cloud Finance Information Technology Co Ltd filed Critical Anhui Cloud Finance Information Technology Co Ltd
Priority to CN201811392810.0A priority Critical patent/CN109657118A/en
Publication of CN109657118A publication Critical patent/CN109657118A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/0643Hash functions, e.g. MD5, SHA, HMAC or f9 MAC

Landscapes

  • Engineering & Computer Science (AREA)
  • Power Engineering (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of URL De-weight method of distributed network crawler and its systems, are related to field of data transmission.The present invention includes the following steps: step S01: obtaining the URL of webpage to be grabbed;Step S02:URL carries out MD5 compression processing;Step S03: the ciphertext of generation is cut into 16 bit arrays;Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;Step S05: judge in linkurl with the presence or absence of corresponding URL;Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;Step S07: URL each letter of the traversal to duplicate removal;Step S08: successively judge that the corresponding node of each letter whether there is.The present invention is in such a way that md5 encryption algorithm is combined with tree to the preliminary duplicate removal of URL, again by the way that the URL of preliminary duplicate removal is successively traversed each letter, secondary duplicate removal is carried out according to the corresponding node of letter, web crawlers crawl data accuracy and efficiency is improved, reduces resource space occupation rate.

Description

A kind of the URL De-weight method and its system of distributed network crawler
Technical field
The invention belongs to field of data transmission, a kind of URL De-weight method more particularly to distributed network crawler and its System.
Background technique
Currently, the common URL duplicate removal scheme of web crawlers has duplicate removal scheme based on database and based on memory chained list Duplicate removal scheme, these schemes have good effect in the case where URL storage capacity is little.But existing distributed reptile face To URL storage capacity it is usually very big, need URL duplicate removal that can persistently keep Efficient Operation, and above-mentioned common URL removing repeat Case efficient can fall sharply after crawlers run the long period or the risk of task paralysis.URL removing repeat in the prior art Method design is not reasonable, needs to improve.
Summary of the invention
The purpose of the present invention is to provide a kind of URL De-weight method of distributed network crawler and its system, by MD5 plus Close algorithm with the mode that combines of tree to the preliminary duplicate removal of URL, then by the way that the URL of preliminary duplicate removal is successively traversed each letter, root Carry out secondary duplicate removal according to the corresponding node of letter, solve existing web crawlers crawl data accuracy and efficiency it is insufficient and Occupy the problem more than resource.
In order to solve the above technical problems, the present invention is achieved by the following technical solutions:
The present invention is a kind of URL De-weight method of distributed network crawler, is included the following steps:
Step S01: the URL of webpage to be grabbed is obtained;
Step S02: the URL that will acquire carries out 16 MD5 compression processings;
Step S03: the ciphertext of generation is cut into 16 bit arrays;
Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;
Step S05: judge in linkurl with the presence or absence of corresponding URL;
If it does not exist, then URL is stored to resources bank;
If it exists, then the URL is deleted;
Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;
Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed;
Step S08: successively judge that the corresponding node of each letter whether there is;
If it exists, then by this URL duplicate removal;
If it does not exist, then this URL is stored to queue to be grabbed.
Preferably, in the step S02, URL encryption is stored in the form of a tree after the completion.
It preferably,, will be under the value and root node of [0] data a after ciphertext is cut into 16 array a in the step S03 The nodal value of direction is compared;If it exists, then the node r [1] found the next node being directed toward is compared with a [1], according to Secondary comparison is until a [15] compare end.
Preferably, in the step S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, If the corresponding generalized list node of a certain letter is not present, need first to create root node in the node respective layer and the letter After corresponding node, return continues to traverse.
The present invention is a kind of URL machining system of distributed network crawler, including processor and memory, the processor Successively electrically connect with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module and memory It connects;Described search module, for being successively read the address of a webpage to be processed and being transferred to encrypting module;The encryption mould Block carries out MD5 compression encryption for the URL to acquisition and to URL;The conversion module, to be incited somebody to action according to disk symbol The ciphertext array generated after encryption is converted into respective paths;Place is decrypted for the URL to preliminary duplicate removal in the deciphering module Reason;The insertion module, for that will decrypt in the URL completed insertion generalized list;The deduplication module, for according to generalized list root The each letter of node traverses URL carries out secondary duplicate removal.
The invention has the following advantages:
The present invention in such a way that md5 encryption algorithm is combined with tree to the preliminary duplicate removal of URL, then by by preliminary duplicate removal URL successively traverses each letter, carries out secondary duplicate removal according to the corresponding node of letter, it is accurate to improve web crawlers crawl data Property and efficiency, reduce resource space occupation rate.
Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is a kind of URL De-weight method block diagram of distributed network crawler of the invention;
Fig. 2 is a kind of URL machining system structural schematic diagram of distributed network crawler of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, the present invention is a kind of URL De-weight method of distributed network crawler, include the following steps:
Step S01: the URL of webpage to be grabbed is obtained;
Step S02: the URL that will acquire carries out 16 MD5 compression processings;
Step S03: the ciphertext of generation is cut into 16 bit arrays;
Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;
Step S05: judge in linkurl with the presence or absence of corresponding URL;
If it does not exist, illustrate that this URL is new URL, then store the URL to resources bank;
If it exists, illustrate have corresponding URL in linkurl, then delete the URL;
Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;
Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed;
Step S08: successively judge that the corresponding node of each letter whether there is;
If letter traverses the last letter of entire URL, corresponding node is still remained, then by this URL duplicate removal;
If it does not exist, then this URL is stored to queue to be grabbed.
Wherein, in step S02, URL encryption is stored in the form of a tree after the completion.
Wherein, in step S03, after ciphertext is cut into 16 array a, by what is be directed toward under the value of data a [0] and root node Nodal value is compared;If it exists, then the node r [1] found the next node being directed toward is compared with a [1], is successively compared Until a [15] compare end;If not finding a newly-built value is current relatively character, subsequent node value is next character Node.
Wherein, in step S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, if a certain The corresponding generalized list node of letter is not present, then needs first to create the corresponding section of the root node and the letter in the node respective layer After point, return continues to traverse;If this corresponding node of letter exists, this node steering head pointer is entered next Layer continues to execute traversal.
It please refers to shown in Fig. 2, the present invention is a kind of URL machining system of distributed network crawler, including processor and deposits Reservoir, the processor successively with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module with And memory is electrically connected;Described search module, for being successively read the address of a webpage to be processed and being transferred to encryption mould Block;The encrypting module carries out MD5 compression encryption for the URL to acquisition and to URL;The conversion module, to root The ciphertext array generated after encryption is converted into respective paths according to disk symbol;The deciphering module, for preliminary duplicate removal URL is decrypted;The insertion module, for that will decrypt in the URL completed insertion generalized list;The deduplication module is used Secondary duplicate removal is carried out in traversing each letter of URL according to generalized list root node.
It is worth noting that, included each unit is only drawn according to function logic in the above system embodiment Point, but be not limited to the above division, as long as corresponding functions can be realized;In addition, each functional unit is specific Title is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
In addition, those of ordinary skill in the art will appreciate that realizing all or part of the steps in the various embodiments described above method It is that relevant hardware can be instructed to complete by program, corresponding program can store to be situated between in a computer-readable storage In matter.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only It is limited by claims and its full scope and equivalent.

Claims (5)

1. a kind of URL De-weight method of distributed network crawler, which comprises the steps of:
Step S01: the URL of webpage to be grabbed is obtained;
Step S02: the URL that will acquire carries out 16 MD5 compression processings;
Step S03: the ciphertext of generation is cut into 16 bit arrays;
Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode;
Step S05: judge in linkurl with the presence or absence of corresponding URL;
If it does not exist, then URL is stored to resources bank;
If it exists, then the URL is deleted;
Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list;
Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed;
Step S08: successively judge that the corresponding node of each letter whether there is;
If it exists, then by this URL duplicate removal;
If it does not exist, then this URL is stored to queue to be grabbed.
2. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step In S02, URL encryption is stored in the form of a tree after the completion.
3. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step In S03, after ciphertext is cut into 16 array a, the nodal value being directed toward under the value of data a [0] and root node is compared;If In the presence of the node r [1] found the next node being directed toward is compared with a [1] then, successively compares and compares knot up to a [15] Beam.
4. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step In S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, if the corresponding generalized list section of a certain letter Point is not present, then after the corresponding node for needing first to create the root node and the letter in the node respective layer, return continues to traverse.
5. a kind of URL machining system of distributed network crawler as described in claim 1-4 is any one, including processor and deposit Reservoir, it is characterised in that:
The processor successively with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module and Memory is electrically connected;
Described search module, for being successively read the address of a webpage to be processed and being transferred to encrypting module;
The encrypting module carries out MD5 compression encryption for the URL to acquisition and to URL;
The conversion module, the ciphertext array generated after encryption is converted into respective paths according to disk symbol;
The deciphering module is decrypted for the URL to preliminary duplicate removal;
The insertion module, for that will decrypt in the URL completed insertion generalized list;
The deduplication module carries out secondary duplicate removal for traversing each letter of URL according to generalized list root node.
CN201811392810.0A 2018-11-21 2018-11-21 A kind of the URL De-weight method and its system of distributed network crawler Pending CN109657118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811392810.0A CN109657118A (en) 2018-11-21 2018-11-21 A kind of the URL De-weight method and its system of distributed network crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811392810.0A CN109657118A (en) 2018-11-21 2018-11-21 A kind of the URL De-weight method and its system of distributed network crawler

Publications (1)

Publication Number Publication Date
CN109657118A true CN109657118A (en) 2019-04-19

Family

ID=66111452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811392810.0A Pending CN109657118A (en) 2018-11-21 2018-11-21 A kind of the URL De-weight method and its system of distributed network crawler

Country Status (1)

Country Link
CN (1) CN109657118A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012101158A1 (en) * 2011-01-25 2012-08-02 Openwave Systems Inc. A method and system of determining whether a requested content element is in a cache
CN103984753A (en) * 2014-05-28 2014-08-13 北京京东尚科信息技术有限公司 Method and device for extracting web crawler reduplication-removing characteristic value
CN107844527A (en) * 2017-10-13 2018-03-27 平安科技(深圳)有限公司 Web page address De-weight method, electronic equipment and computer-readable recording medium
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012101158A1 (en) * 2011-01-25 2012-08-02 Openwave Systems Inc. A method and system of determining whether a requested content element is in a cache
CN103984753A (en) * 2014-05-28 2014-08-13 北京京东尚科信息技术有限公司 Method and device for extracting web crawler reduplication-removing characteristic value
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN107844527A (en) * 2017-10-13 2018-03-27 平安科技(深圳)有限公司 Web page address De-weight method, electronic equipment and computer-readable recording medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN111324797B (en) * 2020-02-20 2023-08-11 民生科技有限责任公司 Method and device for precisely acquiring data at high speed
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request

Similar Documents

Publication Publication Date Title
CN103548003B (en) Method and system for improving the client-side fingerprint cache of deduplication system backup performance
CN102307206B (en) Caching system and caching method for rapidly accessing virtual machine images based on cloud storage
CN104503708B (en) The method and device of data hash storage
CN103678337B (en) Data clearing method, apparatus and system
CN102142032B (en) Method and system for reading and writing data of distributed file system
CN107045422A (en) Distributed storage method and equipment
CN102662992A (en) Method and device for storing and accessing massive small files
CN103186554A (en) Distributed data mirroring method and data storage node
CN105069111A (en) Similarity based data-block-grade data duplication removal method for cloud storage
CN104809182A (en) Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN110321325A (en) File inode lookup method, terminal, server, system and storage medium
CN103077208B (en) URL(uniform resource locator) matched processing method and device
CN110399348A (en) File deletes method, apparatus, system and computer readable storage medium again
CN101158954A (en) Method for recognizing repeat data in computer storage
CN106407224A (en) Method and device for file compaction in KV (Key-Value)-Store system
CN104184812A (en) Multi-point data transmission method based on private cloud
CN109657118A (en) A kind of the URL De-weight method and its system of distributed network crawler
CN103823807A (en) Data de-duplication method, device and system
CN104965835B (en) A kind of file read/write method and device of distributed file system
CN107798106A (en) A kind of URL De-weight methods in distributed reptile system
CN113900810A (en) Distributed graph processing method, system and storage medium
CN104933054A (en) Uniform resource locator (URL) storage method and device of cache resource file, and cache server
CN106776795A (en) Method for writing data and device based on Hbase databases
CN109544344B (en) Block chain transaction processing method and equipment based on DAG
CN103049561B (en) A kind of data compression method, storage engines and storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190419