CN109657118A

CN109657118A - A kind of the URL De-weight method and its system of distributed network crawler

Info

Publication number: CN109657118A
Application number: CN201811392810.0A
Authority: CN
Inventors: 胡翔
Original assignee: Anhui Cloud Finance Information Technology Co Ltd
Current assignee: Anhui Cloud Finance Information Technology Co Ltd
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2019-04-19

Abstract

The invention discloses a kind of URL De-weight method of distributed network crawler and its systems, are related to field of data transmission.The present invention includes the following steps: step S01: obtaining the URL of webpage to be grabbed；Step S02:URL carries out MD5 compression processing；Step S03: the ciphertext of generation is cut into 16 bit arrays；Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode；Step S05: judge in linkurl with the presence or absence of corresponding URL；Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list；Step S07: URL each letter of the traversal to duplicate removal；Step S08: successively judge that the corresponding node of each letter whether there is.The present invention is in such a way that md5 encryption algorithm is combined with tree to the preliminary duplicate removal of URL, again by the way that the URL of preliminary duplicate removal is successively traversed each letter, secondary duplicate removal is carried out according to the corresponding node of letter, web crawlers crawl data accuracy and efficiency is improved, reduces resource space occupation rate.

Description

A kind of the URL De-weight method and its system of distributed network crawler

Technical field

The invention belongs to field of data transmission, a kind of URL De-weight method more particularly to distributed network crawler and its System.

Background technique

Currently, the common URL duplicate removal scheme of web crawlers has duplicate removal scheme based on database and based on memory chained list Duplicate removal scheme, these schemes have good effect in the case where URL storage capacity is little.But existing distributed reptile face To URL storage capacity it is usually very big, need URL duplicate removal that can persistently keep Efficient Operation, and above-mentioned common URL removing repeat Case efficient can fall sharply after crawlers run the long period or the risk of task paralysis.URL removing repeat in the prior art Method design is not reasonable, needs to improve.

Summary of the invention

The purpose of the present invention is to provide a kind of URL De-weight method of distributed network crawler and its system, by MD5 plus Close algorithm with the mode that combines of tree to the preliminary duplicate removal of URL, then by the way that the URL of preliminary duplicate removal is successively traversed each letter, root Carry out secondary duplicate removal according to the corresponding node of letter, solve existing web crawlers crawl data accuracy and efficiency it is insufficient and Occupy the problem more than resource.

In order to solve the above technical problems, the present invention is achieved by the following technical solutions:

The present invention is a kind of URL De-weight method of distributed network crawler, is included the following steps:

Step S01: the URL of webpage to be grabbed is obtained；

Step S02: the URL that will acquire carries out 16 MD5 compression processings；

Step S03: the ciphertext of generation is cut into 16 bit arrays；

Step S04: ciphertext generation array is converted to by respective paths according to disk symbolic look-up mode；

Step S05: judge in linkurl with the presence or absence of corresponding URL；

If it does not exist, then URL is stored to resources bank；

If it exists, then the URL is deleted；

Step S06: being decrypted the URL in resources bank and dynamic insertion improves generalized list；

Step S07: from dynamic generalized list root node, each letter of URL to duplicate removal is traversed；

Step S08: successively judge that the corresponding node of each letter whether there is；

If it exists, then by this URL duplicate removal；

If it does not exist, then this URL is stored to queue to be grabbed.

Preferably, in the step S02, URL encryption is stored in the form of a tree after the completion.

It preferably,, will be under the value and root node of [0] data a after ciphertext is cut into 16 array a in the step S03 The nodal value of direction is compared；If it exists, then the node r [1] found the next node being directed toward is compared with a [1], according to Secondary comparison is until a [15] compare end.

Preferably, in the step S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, If the corresponding generalized list node of a certain letter is not present, need first to create root node in the node respective layer and the letter After corresponding node, return continues to traverse.

The present invention is a kind of URL machining system of distributed network crawler, including processor and memory, the processor Successively electrically connect with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module and memory It connects；Described search module, for being successively read the address of a webpage to be processed and being transferred to encrypting module；The encryption mould Block carries out MD5 compression encryption for the URL to acquisition and to URL；The conversion module, to be incited somebody to action according to disk symbol The ciphertext array generated after encryption is converted into respective paths；Place is decrypted for the URL to preliminary duplicate removal in the deciphering module Reason；The insertion module, for that will decrypt in the URL completed insertion generalized list；The deduplication module, for according to generalized list root The each letter of node traverses URL carries out secondary duplicate removal.

The invention has the following advantages:

The present invention in such a way that md5 encryption algorithm is combined with tree to the preliminary duplicate removal of URL, then by by preliminary duplicate removal URL successively traverses each letter, carries out secondary duplicate removal according to the corresponding node of letter, it is accurate to improve web crawlers crawl data Property and efficiency, reduce resource space occupation rate.

Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of URL De-weight method block diagram of distributed network crawler of the invention；

Fig. 2 is a kind of URL machining system structural schematic diagram of distributed network crawler of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.

Refering to Figure 1, the present invention is a kind of URL De-weight method of distributed network crawler, include the following steps:

Step S01: the URL of webpage to be grabbed is obtained；

Step S03: the ciphertext of generation is cut into 16 bit arrays；

Step S05: judge in linkurl with the presence or absence of corresponding URL；

If it does not exist, illustrate that this URL is new URL, then store the URL to resources bank；

If it exists, illustrate have corresponding URL in linkurl, then delete the URL；

If letter traverses the last letter of entire URL, corresponding node is still remained, then by this URL duplicate removal；

If it does not exist, then this URL is stored to queue to be grabbed.

Wherein, in step S02, URL encryption is stored in the form of a tree after the completion.

Wherein, in step S03, after ciphertext is cut into 16 array a, by what is be directed toward under the value of data a [0] and root node Nodal value is compared；If it exists, then the node r [1] found the next node being directed toward is compared with a [1], is successively compared Until a [15] compare end；If not finding a newly-built value is current relatively character, subsequent node value is next character Node.

Wherein, in step S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, if a certain The corresponding generalized list node of letter is not present, then needs first to create the corresponding section of the root node and the letter in the node respective layer After point, return continues to traverse；If this corresponding node of letter exists, this node steering head pointer is entered next Layer continues to execute traversal.

It please refers to shown in Fig. 2, the present invention is a kind of URL machining system of distributed network crawler, including processor and deposits Reservoir, the processor successively with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module with And memory is electrically connected；Described search module, for being successively read the address of a webpage to be processed and being transferred to encryption mould Block；The encrypting module carries out MD5 compression encryption for the URL to acquisition and to URL；The conversion module, to root The ciphertext array generated after encryption is converted into respective paths according to disk symbol；The deciphering module, for preliminary duplicate removal URL is decrypted；The insertion module, for that will decrypt in the URL completed insertion generalized list；The deduplication module is used Secondary duplicate removal is carried out in traversing each letter of URL according to generalized list root node.

It is worth noting that, included each unit is only drawn according to function logic in the above system embodiment Point, but be not limited to the above division, as long as corresponding functions can be realized；In addition, each functional unit is specific Title is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.

In addition, those of ordinary skill in the art will appreciate that realizing all or part of the steps in the various embodiments described above method It is that relevant hardware can be instructed to complete by program, corresponding program can store to be situated between in a computer-readable storage In matter.

Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of URL De-weight method of distributed network crawler, which comprises the steps of:

Step S01: the URL of webpage to be grabbed is obtained；

Step S03: the ciphertext of generation is cut into 16 bit arrays；

Step S05: judge in linkurl with the presence or absence of corresponding URL；

If it does not exist, then URL is stored to resources bank；

If it exists, then the URL is deleted；

If it exists, then by this URL duplicate removal；

If it does not exist, then this URL is stored to queue to be grabbed.

2. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step In S02, URL encryption is stored in the form of a tree after the completion.

3. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step In S03, after ciphertext is cut into 16 array a, the nodal value being directed toward under the value of data a [0] and root node is compared；If In the presence of the node r [1] found the next node being directed toward is compared with a [1] then, successively compares and compares knot up to a [15] Beam.

4. a kind of URL De-weight method of distributed network crawler according to claim 1, which is characterized in that the step In S07, each node stores a letter in generalized list, when traversal URL is each alphabetical, if the corresponding generalized list section of a certain letter Point is not present, then after the corresponding node for needing first to create the root node and the letter in the node respective layer, return continues to traverse.

5. a kind of URL machining system of distributed network crawler as described in claim 1-4 is any one, including processor and deposit Reservoir, it is characterised in that:

The processor successively with search module, encrypting module, conversion module, deciphering module, insertion module, deduplication module and Memory is electrically connected；

Described search module, for being successively read the address of a webpage to be processed and being transferred to encrypting module；

The encrypting module carries out MD5 compression encryption for the URL to acquisition and to URL；

The conversion module, the ciphertext array generated after encryption is converted into respective paths according to disk symbol；

The deciphering module is decrypted for the URL to preliminary duplicate removal；

The insertion module, for that will decrypt in the URL completed insertion generalized list；

The deduplication module carries out secondary duplicate removal for traversing each letter of URL according to generalized list root node.