CN110399546A

CN110399546A - Link De-weight method, device, equipment and storage medium based on web crawlers

Info

Publication number: CN110399546A
Application number: CN201910670803.0A
Authority: CN
Inventors: 雷建云; 王锦群; 郑禄; 毛腾跃; 孙翀; 马尧; 张蕾
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2019-11-01
Anticipated expiration: 2039-07-23
Also published as: CN110399546B

Abstract

The present invention relates to Internet technical fields, disclose a kind of link De-weight method, device, equipment and storage medium based on web crawlers.This method comprises: extracting the first uniform resource position mark URL link of platform to be visited from data grabber request in the data grabber request for receiving agricultural product to be analyzed；According to the first URL link, access request is sent to platform to be visited；Data information after receiving the response that platform to be visited is made according to access request, in the corresponding page of the first URL link of crawl；Data information is parsed, obtains the second URL link embedded in the page, and the second URL link is added to URL queue to be crawled；Using the counting bloom filter of chain feature, and the second URL link crawled in URL queue is treated in conjunction with multiple Hash and carries out joint duplicate removal.The present invention improves the performance of web crawlers by the optimization to link duplicate removal mode, to guarantee information needed for web crawlers can quickly obtain people, promotes user experience.

Description

Link De-weight method, device, equipment and storage medium based on web crawlers

Technical field

The present invention relates to Internet technical field more particularly to a kind of link De-weight method based on web crawlers, device, Equipment and storage medium.

Background technique

The case where web crawlers inevitably encounters the repeated downloads of webpage when carrying out web page crawl, in order to prevent Decline because web crawlers repeats efficiency caused by crawling, wastes server resource, it is therefore desirable to uniform resource locator (Uniform Resource Locator, URL) is filtered duplicate removal.Link duplicate removal mode common at present has: based on the 5th Duplicate removal, depositing based on hash algorithm are compressed in link for message digest algorithm (message-digest algorithm 5, MD5) It stores up the modes such as duplicate removal, link duplicate removal based on Bloom filter and duplicate removal is carried out to link.

Although the link compression duplicate removal mode based on MD5 solves uniform resource locator (Uniform Resource Locator, URL) occupy very big memory space the problem of.But as URL is more and more, memory headroom occupancy can also be got over Come higher, and the low characteristic of the probability that conflicts can reduce the accuracy rate of duplicate checking, therefore can seriously affect the performance of web crawlers.

Although and the storage duplicate removal mode duplicate checking speed based on hash algorithm is fast and accuracy rate is higher, needs to design one Good hash function, and need to safeguard Hash table.In addition, consuming memory can be excessively high as the scale of crawl webpage increases, because This can also seriously affect the performance of web crawlers.

And the link duplicate removal mode based on Bloom filter the problem of although can solve space complexity, but have certain Erroneous judgement, and existing element cannot be deleted.That is, element is more, rate of false alarm can be bigger, therefore can also seriously affect network The performance of crawler.

Therefore, the link duplicate removal mode based on web crawlers that it is urgent to provide a kind of, to promote the performance of web crawlers, so that Information needed for web crawlers can quickly obtain people, and then promote user experience.

Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill Art.

Summary of the invention

It the link De-weight method that the main purpose of the present invention is to provide a kind of based on web crawlers, device, equipment and deposits Storage media, it is intended to the performance of web crawlers be improved by the optimization to link duplicate removal mode, to guarantee that web crawlers can Information needed for quickly obtaining people promotes user experience.

To achieve the above object, the present invention provides a kind of link De-weight method based on web crawlers, the method packet Include following steps:

In the data grabber request for receiving agricultural product to be analyzed, extracted from data grabber request to be visited flat First uniform resource position mark URL of platform links；

According to first URL link, access request is sent to the platform to be visited；

After receiving the response that the platform to be visited is made according to the access request, the first URL chain is grabbed Connect the data information in the corresponding page；

The data information is parsed, obtains the second URL link embedded in the page, and by described second URL link is added to URL queue to be crawled；

Using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the institute in the URL queue to be crawled It states the second URL link and carries out joint duplicate removal.

Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled Before the step of second URL link in queue carries out joint duplicate removal, the method also includes:

The URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, is mentioned Take protocol type part, path sections and the inquiry part of current second URL link；

According to the protocol type part, the path sections and the inquiry part, the current 2nd URL chain is obtained Connect corresponding global feature URL link；

The corresponding relationship between current second URL link and the global feature URL link is established, and will be described right Update should be related into the URL queue to be crawled.

Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled The step of second URL link in queue carries out joint duplicate removal, comprising:

The URL queue to be crawled is traversed, the corresponding global feature of current second URL link traversed is obtained URL link；

Whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtains institute State the corresponding duplicate checking mark of global feature URL link；

According to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple characteristic fragments；

According to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL chain Tab segments, the N are the integer more than or equal to 1；

Multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtains that current second URL link is corresponding to be looked into Weight result；

According to the duplicate checking as a result, being retained or being abandoned behaviour to the second URL link in the URL queue to be crawled Make.

Preferably, described according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, N is obtained After the step of a recombination URL link segment, the method also includes:

Based on MD5 algorithm, obtained N number of recombination URL link segment is compressed respectively, obtains N number of recombination URL link The corresponding character string ciphertext of segment；

The character string ciphertext is replaced into the content in corresponding recombination URL link segment.

Preferably, described that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain The step of connecing corresponding duplicate checking result, comprising:

The corresponding character string ciphertext of N number of recombination URL link segment is extracted, chooses any one from N number of character string ciphertext Character string ciphertext carries out K Hash processing, obtains K cryptographic Hash, and the K is the integer more than or equal to 2；

It is breathed out using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each reference It is uncommon to be worth corresponding spatially-variable counter setting initial count value；

K Hash processing is carried out to remaining N-1 character string ciphertext respectively, it is corresponding to obtain each remaining character string ciphertext K cryptographic Hash；

By the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and with times Reference cryptographic Hash of anticipating is adjacent；

Head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute's rheme The cryptographic Hash of vector space is inserted into a preset characters；

Each is counted with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to the preset characters Number determines the corresponding duplicate checking result of current second URL link.

Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled After the step of second URL link in queue carries out joint duplicate removal, the method also includes:

Based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is compressed, Obtain the corresponding character string ciphertext of each second URL link；

The character string ciphertext is replaced into the content in corresponding second URL link.

With the presence or absence of the second URL link accessed in the URL queue to be crawled after judging duplicate removal；

If there is the second URL link accessed in the URL queue to be crawled, by the 2nd URL accessed Link is deleted from the URL queue to be crawled.

In addition, to achieve the above object, the present invention also proposes a kind of link duplicate removal device based on web crawlers, the dress It sets and includes:

Extraction module, for being requested from the data grabber in the data grabber request for receiving agricultural product to be analyzed Middle the first uniform resource position mark URL link for extracting platform to be visited；

Sending module, for sending access request to the platform to be visited according to first URL link；

Handling module, for grabbing after receiving the response that the platform to be visited is made according to the access request Data information in the corresponding page of first URL link；

Parsing module obtains the second URL link embedded in the page for parsing to the data information, And second URL link is added to URL queue to be crawled；

Deduplication module, for using chain feature counting bloom filter, and in conjunction with multiple Hash to described wait crawl Second URL link in URL queue carries out joint duplicate removal.

In addition, to achieve the above object, the present invention also proposes that heavy equipment is gone in a kind of link based on web crawlers, described to set It is standby include: memory, processor and be stored on the memory and can run on the processor based on web crawlers Link remove master control program, the link based on web crawlers goes master control program to be arranged for carrying out as described above to climb based on network The step of link De-weight method of worm.

In addition, to achieve the above object, the present invention also proposes a kind of computer readable storage medium, described computer-readable It is stored with the link based on web crawlers on storage medium and removes master control program, the link based on web crawlers goes master control program to be located Manage the step of realizing the link De-weight method based on web crawlers as described above when device executes.

Link duplicate removal scheme provided by the invention based on web crawlers, by using the grand filtering of counting cloth of chain feature Device, and joint duplicate removal is carried out to the second URL link cached in the URL queue to be crawled in conjunction with multiple Hash, as far as possible The False Rate of counting bloom filter is reduced, the performance of web crawlers is significantly improved, to guarantee that web crawlers can Information needed for quickly obtaining people promotes user experience.

Detailed description of the invention

Fig. 1 is that heavy equipment is gone in the link based on web crawlers for the hardware running environment that the embodiment of the present invention is related to Structural schematic diagram；

Fig. 2 is that the present invention is based on the flow diagrams of the link De-weight method first embodiment of web crawlers；

Fig. 3 is that the present invention is based on the flow diagrams of the link De-weight method second embodiment of web crawlers；

Fig. 4 is the structural block diagram of the link duplicate removal device first embodiment the present invention is based on web crawlers.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that described herein, specific examples are only used to explain the present invention, is not intended to limit the present invention.

Referring to Fig.1, Fig. 1 is that the link based on web crawlers for the hardware running environment that the embodiment of the present invention is related to is gone Heavy equipment structural schematic diagram.

As shown in Figure 1, being somebody's turn to do the link based on web crawlers to go heavy equipment may include: processor 1001, such as centre It manages device (Central Processing Unit, CPU), communication bus 1002, user interface 1003, network interface 1004, storage Device 1005.Wherein, communication bus 1002 is for realizing the connection communication between these components.User interface 1003 may include showing Display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include the wired of standard Interface, wireless interface.Network interface 1004 optionally may include standard wireline interface and wireless interface (such as Wireless Fidelity (WIreless-FIdelity, WI-FI) interface).Memory 1005 can be the random access memory (Random of high speed Access Memory, RAM) memory, be also possible to stable nonvolatile memory (Non-Volatile Memory, ), such as magnetic disk storage NVM.Memory 1005 optionally can also be the storage device independently of aforementioned processor 1001.

The link based on web crawlers is gone it will be understood by those skilled in the art that structure shown in Fig. 1 is not constituted The restriction of heavy equipment may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.

As shown in Figure 1, as may include operating system, network communication mould in a kind of memory 1005 of storage medium Master control program is removed in block, Subscriber Interface Module SIM and the link based on web crawlers.

Link based on web crawlers shown in Fig. 1 is gone in heavy equipment, and network interface 1004 is mainly used for taking with network Business device carries out data communication；User interface 1003 is mainly used for carrying out data interaction with user；The present invention is based on web crawlers Processor 1001, the memory 1005 that link is gone in heavy equipment can be set to be gone in heavy equipment in the link based on web crawlers, The link based on web crawlers goes heavy equipment to call what is stored in memory 1005 to climb based on network by processor 1001 Master control program is removed in the link of worm, and executes the link De-weight method provided in an embodiment of the present invention based on web crawlers.

The link De-weight method based on web crawlers that the embodiment of the invention provides a kind of is the present invention referring to Fig. 2, Fig. 2 A kind of flow diagram of the link De-weight method first embodiment based on web crawlers.

In the present embodiment, the link De-weight method based on web crawlers the following steps are included:

Step S10 is extracted from data grabber request in the data grabber request for receiving agricultural product to be analyzed First uniform resource position mark URL of platform to be visited links.

Specifically, the executing subject of the present embodiment is any deployment or the terminal device for being equipped with network crawler system.

It is noted that in the present embodiment, in order to improve the crawl of the corresponding data of agricultural product to be analyzed as far as possible Speed, resolution speed etc. operate, described network crawler system preferred distribution formula network crawler system in the present embodiment.

However, it should be understood that the terminal device can be client device in practical applications, it is also possible to take Business device end equipment, herein with no restrictions.

In addition, above-mentioned described platform to be visited can be the network provider for showing and needing anal yzing agricul products in practical applications City.

Correspondingly, described uniform resource locator (Uniform Resource Locator, URL) is to access the net Network address needed for network store.

However, it should be understood that above-mentioned described agricultural product to be analyzed are to various agricultural product common at present One general designation, agricultural product to be analyzed can be tea product, fruit and vegetable food, cereal product etc. in practical applications, herein no longer It enumerates, any restrictions is not also done to this.

Step S20 sends access request to the platform to be visited according to first URL link.

Specifically, in practical applications, web crawlers can be using based on transmission control protocol/Internet Protocol (Transmission Control Protocol/Internet Protocol, ICP/IP protocol) transmits the super texts of data This transport protocol (HyperText Transfer Protocol, HTTP) is to platform (the substantially clothes of the platform to be visited Business device) send access request.

It should be understood that being given above only a kind of specific implementation for sending access request to the platform to be visited Mode does not constitute any restriction to technical solution of the present invention, and in practical applications, those skilled in the art can basis It needs to be configured, herein with no restrictions.

Step S30 grabs described the after receiving the response that the platform to be visited is made according to the access request Data information in the corresponding page of one URL link.

It should be understood that in practical applications, if the access request success sent to the platform to be visited, and After the platform to be visited is proved to be successful the first URL link carried in the access request, and successful response can be made, And feed back the data information in the corresponding page of first URL link.At this point, web crawlers and can grab described to be visited The data information of platform feedback being directed in the corresponding page of first URL link.

Step S40 parses the data information, obtains the second URL link embedded in the page, and by institute It states the second URL link and is added to URL queue to be crawled.

It should be understood that in practical applications, in addition to that can show with described wait divide in the corresponding page of the first URL link Analyse the identical data information of agricultural product, it is also possible to multiple URL links relevant to the data information can be shown, for the ease of area Divide referred to herein as the second URL link.

Such as a net including the agricultural product to be analyzed is shown in the corresponding page of the first URL link Network store homepage mainly shows the four major class agricultural production such as agricultural product A, agricultural product B, agricultural product C and agricultural product D in the homepage Product information, while each big agricultural products are corresponding with second URL link again, it is main in the corresponding page of the second URL link Show the small agricultural products that corresponding agricultural product include.

For example, mainly showing agricultural product A-1, agricultural product A- in the corresponding page of corresponding second URL link of agricultural product A 2 and agricultural product A-3；Agricultural product B-1 and agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product B B-2；Agricultural product C-1, agricultural product C-2, agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product C C-3 and agricultural product C4；Agricultural product D-1 and agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product D D-2。

It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.

In addition, in the present embodiment, why the second URL link embedded in the page is added to wait crawl URL queue is data the second URL link number that is more, thus parsing because web crawlers crawls in practical applications It measures relatively bulky.And it often crawls, parse second URL link and can consume many times, thus a large amount of second URL link It tends not to access in the short time, therefore needs for the second URL link got every time to be added in URL queue to be crawled.

" second " in addition, " first " in above-mentioned described " the first URL link ", and in " the second URL link " is only It is only for distinguishing the URL link embedded in the corresponding URL link of the platform to be visited page corresponding with the URL link, not URL link itself is caused to limit.In practical applications, any one " second URL link " is relative in its corresponding page Embedded URL link can be regarded as one " the first URL link ".

Step S50, using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the URL team to be crawled Second URL link in column carries out joint duplicate removal.

Specifically, the above-mentioned described counting bloom filter using chain feature, and in conjunction with multiple Hash to described The joint duplicate removal that second URL link in URL queue to be crawled carries out, is broadly divided into corresponding to the URL link whole Body characteristics URL link duplicate removal and to URL link segment duplicate removal.

And URL link segment is obtained according to global feature URL link, thus in order to guarantee above-mentioned joint duplicate removal Operation can be gone on smoothly, and need first to determine the corresponding relationship between the second URL link and global feature URL link.

In order to make it easy to understand, the present embodiment provide it is corresponding between a kind of the second URL link of determination and global feature URL link The specific implementation of relationship, approximately as:

(1) the URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, Extract protocol type part, path sections and the inquiry part of current second URL link.

Specifically, since URL link in practical applications is the resource on unique identification network.Also, one As for, a URL link would generally include following five component parts: protocol type part (usually use Protocol table Show), server address part (usual user Host indicate), port numbers part (usually being indicated with Port), path sections (usually Indicated with Path) and inquiry part (usually being indicated with Fragment).

Wherein, protocol type part, path sections and these three parts of inquiry part can usually embody a URL chain The feature connect.

Thus, the present embodiment is by traversing the URL queue to be crawled, and to current 2nd URL traversed Link carries out signature analysis, and then extracts the protocol type part of current second URL link (below subsequent explanation User p₁Indicate), path sections are (for the ease of user p below subsequent explanation₂Indicate) and inquire that part (continues for the ease of after Bright following user p₃It indicates).

(2) according to the protocol type part, the path sections and the inquiry part, described current second is obtained The corresponding global feature URL link of URL link.

Specifically, due to p₁、p₂And p₃This three parts can embody whole features of current second URL link, thus By to p₁、p₂And p₃The corresponding global feature URL link of current second URL link can be obtained by being combined, and be used below p₁p₂p₃Indicate the corresponding global feature URL link of each second URL link.

(3) corresponding relationship between current second URL link and the global feature URL link is established, and by institute Corresponding relationship is stated to update into the URL queue to be crawled.

Specifically, current second URL link and the global feature URL chain why are established in the present embodiment Corresponding relationship between connecing, and it is subsequent to for convenience into the URL queue to be crawled that the corresponding relationship, which updated, During two URL link duplicate removals, can the corresponding relationship be quickly found out the corresponding global feature URL chain of current second URL link It connects, and then the corresponding URL link segment of current second URL link is obtained according to whole URL link.

In addition, in practical applications, the corresponding relationship can not also be updated into the URL queue to be crawled, and It is individually to store.When treating the second URL link crawled in URL queue and carrying out joint duplicate removal, according to current traversed Two URL links search the corresponding global feature URL link of current second URL link i.e. from the mapping table individually stored It can.

Further, obtain above-mentioned corresponding relationship and the corresponding global feature URL link of each second URL link it Afterwards, the above-mentioned described counting bloom filter using chain feature, and in conjunction with multiple Hash in the URL queue to be crawled Second URL link carry out the operation of joint duplicate removal, specifically can be as described below:

(1) the URL queue to be crawled is traversed, obtains the corresponding entirety of current second URL link traversed Feature URL link.

Specifically, obtaining the corresponding global feature URL link of current second URL link traversed is according to above-mentioned Described corresponding relationship obtains.

(2) whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtained The corresponding duplicate checking mark of the global feature URL link.

Specifically, counting bloom filter employed in the present embodiment and non-existing use when link duplicate removal The counting grand device of cloth, but the counting bloom filter of the chain feature based on URL link.

That is, the calculating Bloom filter of the present embodiment is climbed when carrying out duplicate removal to link particular by treating The corresponding global feature URL link of each second URL link in URL queue is taken to carry out feature identification, then basis recognizes Feature carry out whole duplicate checking, i.e., be to have entered link to each second to carry out Characteristic Contrast, and then realize whole look into duplicate removal Weight.

Also, identify for convenience it is subsequent recombinated according to characteristic fragment after URL link segment, can also be global feature URL link distributes corresponding duplicate checking mark.

(3) according to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple feature pieces Section.

It specifically, with global feature URL link is still p₁p₂p₃For, by being carried out to the global feature URL link After feature identification, obtained multiple characteristic fragments, which specifically can be, respectively includes protocol type part, path sections and asking portion The segment divided, i.e., to characteristic fragment p₁, characteristic fragment p₂With characteristic fragment p₃。

(4) according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link segment.

It should be understood that since a global feature URL link is by protocol type part, path sections and inquiry three Part composition, thus 1 recombination URL link segment can be at least obtained, therefore N is the integer more than or equal to 1 in the present embodiment.

In addition, total in practical application, the URL link reformulation rule can by those skilled in the art as needed into URL link segment after row setting, such as regulation recombination must include characteristic fragment p₁, or the URL link segment after recombination It cannot include characteristic fragment p₃Deng will not enumerate, any restrictions also do not done to this herein.

Correspondingly, if URL link reformulation rule be recombination after URL link segment must include characteristic fragment p₁, then It only includes p that obtained recombination URL link segment, which generally comprises,₁The URL link segment of characteristic fragment only includes p₁Characteristic fragment and p₂The URL link segment of characteristic fragment, and only include p₁Characteristic fragment and p₃The URL link segment of characteristic fragment.

If URL link reformulation rule is that the URL link segment after recombination cannot include characteristic fragment p₃, then the weight that obtains It only includes p that group URL link segment, which generally comprises,₁The URL link segment of characteristic fragment and only include p₁Characteristic fragment and p₂Feature piece The URL link segment of section.

It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art can be configured according to actual needs, herein with no restrictions.

(5) multiple Hash duplicate checking is carried out to N number of recombination URL link segment, it is corresponding obtains current second URL link Duplicate checking result.

It is noted that the second URL link being buffered in URL queue to be crawled may due in practical applications Have largely, thus the URL link segment obtained after recombinating can be more more.Therefore, in the present embodiment, in order to reduce as far as possible pair The second URL link cached in URL queue to be crawled is to the occupancy of memory space, according to preset URL link reformulation rule, The multiple characteristic fragment is recombinated, after obtaining N number of recombination URL link segment, can first be based on MD5 algorithm, to To N number of recombination URL link segment compressed respectively, and then it is close to obtain the corresponding character string of N number of recombination URL link segment The character string ciphertext is finally replaced the content in corresponding recombination URL link segment by text.

It should be understood that being given above only a kind of specific compress mode, not to technical solution of the present invention Any restriction is constituted, in practical applications, those skilled in the art can choose suitable compression method according to actual needs, Herein with no restrictions.

Correspondingly, above-mentioned that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain The operation of corresponding duplicate checking result is connect, specifically:

(5-1) extracts the corresponding character string ciphertext of N number of recombination URL link segment, chooses from N number of character string ciphertext any One character string ciphertext carries out K Hash processing, obtains K cryptographic Hash.

It should be understood that being tapped into due to the link duplicate removal scheme provided in this embodiment based on web crawlers to chain What is specifically combined when row joint duplicate removal is multiple Hash, i.e., at least needs to carry out 2 Hash processing to a character string ciphertext, therefore Above-mentioned described K is the integer more than or equal to 2.

(5-2) joins using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each Examine the corresponding spatially-variable counter setting initial count value of cryptographic Hash.

Specifically, each in the present embodiment with reference to the initial meter shown on the corresponding spatially-variable counter of cryptographic Hash Numerical value is indicated with " 0 ".

(5-3) carries out K Hash processing to remaining N-1 character string ciphertext respectively, and it is close to obtain each remaining character string The corresponding K cryptographic Hash of text.

(5-4) by the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and It is adjacent with reference to cryptographic Hash with any one.

Specifically, it is referred to actually with that for the ease of determining newly to hash to the cryptographic Hash in institute's bit vector space Cryptographic Hash is adjacent, can preset a determining standard, such as two neighboring with reference to being inserted into new Hash between cryptographic Hash It, can be using the nearest reference cryptographic Hash of cryptographic Hash that selected distance is newly inserted into as adjacent reference cryptographic Hash when value.

(5-5) uses head to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute The cryptographic Hash in bit vector space is inserted into a preset characters.

Specifically, the preset characters select " 1 " to indicate in the present embodiment.

Such as cryptographic Hash is referred to for one, the initial count value shown on corresponding spatially-variable counter is "0".When there is a new cryptographic Hash hash to position adjacent thereto, it is necessary to insert method being previously inserted into " 0 " using head One preset characters " 1 ", the count value shown on spatially-variable counter at this time become " 10 ".

Correspondingly, position has been thought with reference to cryptographic Hash to this if there are two new cryptographic Hash hash, needed using head The method of inserting is previously inserted into two preset characters " 1 " in " 0 ", and the count value shown on spatially-variable counter at this time becomes " 110 ".

(5-6) counts each with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to described default The number of character determines the corresponding duplicate checking result of current second URL link.

Specifically, determining duplicate checking result can be with are as follows:

If the number of the preset characters " 1 " before initial count value " 0 " is greater than 1, it is determined that the recombination URL segment weight It is multiple, it needs to abandon；

Otherwise, it determines the recombination URL segment does not repeat, can retain.

(6) according to the duplicate checking as a result, the second URL link in the URL queue to be crawled is retained or abandoned Operation.

It should be understood that only a kind of specific implementation for combining duplicate removal is given above, to technology of the invention Scheme does not constitute any restriction, and in practical applications, those skilled in the art, which can according to need, to be reasonably adjusted, herein not It is limited.

In addition, in practical applications, in order to further reduce the occupancy to memory space, in the meter using chain feature Number Bloom filter, and joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash Later, it is also based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is pressed Contracting, and then obtain the corresponding character string ciphertext of each second URL link；Finally the character string ciphertext is replaced corresponding Content in second URL link reduces empty to storage to compress the second URL link in URL queue to be crawled as far as possible Between occupancy.

By foregoing description it is not difficult to find out that, the link De-weight method provided in this embodiment based on web crawlers, by adopting With the counting bloom filter of chain feature, and in conjunction with multiple Hash to the 2nd URL chain cached in the URL queue to be crawled The whole joint duplicate removal with part of row is tapped into be effectively improved to reduce the False Rate of counting bloom filter as far as possible The performance of web crawlers, information needed for enabling web crawlers quickly to obtain people improve user as far as possible Experience.

In addition, during duplicate removal, by being based on compression algorithm, if MD5 algorithm compresses URL link, thus to the greatest extent The possible occupancy reduced to memory space.

With reference to Fig. 3, Fig. 3 is a kind of process signal of link De-weight method second embodiment based on web crawlers of the present invention Figure.

Based on above-mentioned first embodiment, the present embodiment based on the link De-weight method of web crawlers the step S50 it Afterwards, further includes:

Step S60, with the presence or absence of the second URL link accessed in the URL queue to be crawled after judging duplicate removal.

Specifically, if by judgement, there is second accessed in the URL queue to be crawled after determining duplicate removal URL link, i.e. web crawlers have accessed the corresponding page of the second URL link according to second URL link, and grab Data information in the page causes to repeat to grab identical number in order to avoid web crawlers accesses second URL link again According to wasting the resource of web crawlers, need to be implemented the operation of step S70；Otherwise, step S60 is continued to execute.

Step S70 deletes second URL link accessed from the URL queue to be crawled.

Specifically, in practical applications, it can detect and just held in the case that second URL link is accessed again Delete operation of row can also first be marked the second URL link being accessed, accessed second then marked again When URL link reaches predetermined quantity or predetermined erasing time, currently labeled all second URL links are deleted together.

By foregoing description it is not difficult to find out that, the link De-weight method provided in this embodiment based on web crawlers, by fixed When or real-time detection described in URL queue to be crawled the second URL link access situation, and detecting the URL to be crawled When there is the second URL link being accessed in queue, the second accessed URL link is deleted from URL queue to be crawled It removes, thereby may be ensured that the second URL link cached in the URL queue to be crawled is the second not visited URL link, It avoids web crawlers and crawling for identical data is repeated according to same second URL link, further improve network and climb The performance of worm.

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium On be stored with the link based on web crawlers and remove master control program, the link based on web crawlers goes master control program to be executed by processor The step of Shi Shixian as described above link De-weight method based on web crawlers.

It is the structural block diagram of the link duplicate removal device first embodiment the present invention is based on web crawlers referring to Fig. 4, Fig. 4.

As shown in figure 4, the link duplicate removal device based on web crawlers that the embodiment of the present invention proposes includes: extraction module 4001, sending module 4002, handling module 4003, parsing module 4004 and deduplication module 4005.

Wherein, extraction module 4001, for receive agricultural product to be analyzed data grabber request when, from the data The first uniform resource position mark URL link of platform to be visited is extracted in crawl request；Sending module 4002, for according to First URL link sends access request to the platform to be visited；Handling module 4003, for receive it is described to be visited After the response that platform is made according to the access request, the data information in the corresponding page of first URL link is grabbed；Solution Module 4004 is analysed, for parsing to the data information, obtains the second URL link embedded in the page, and by institute It states the second URL link and is added to URL queue to be crawled；Deduplication module 4005, for the grand filtering of counting cloth using chain feature Device, and joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash.

It should be noted that each module involved in the present embodiment is logic module, and in practical applications, one Logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physical units Combination realize.In addition, in order to protrude innovative part of the invention, it will not be proposed by the invention with solution in the present embodiment The technical issues of the less close unit of relationship introduce, but this does not indicate that there is no other units in present embodiment.

In addition, it is noted that deduplication module 4005 is in the grand filtering of counting cloth using chain feature in the present embodiment Device, and when carrying out joint duplicate removal to second URL link wait crawl in URL queue in conjunction with multiple Hash, specific point For to the corresponding global feature URL link duplicate removal of the URL link and to URL link segment duplicate removal.

And URL link segment is obtained according to global feature URL link, thus in order to guarantee 4005 energy of deduplication module It is enough smoothly to execute aforesaid operations, it needs first to determine the corresponding relationship between the second URL link and global feature URL link.

It, substantially can following institute about the mode for determining corresponding relationship between the second URL link and global feature URL link It states:

Firstly, traversing to the URL queue to be crawled, feature point is carried out to current second URL link traversed Protocol type part, path sections and the inquiry part of current second URL link are extracted in analysis；

Then, according to the protocol type part, the path sections and the inquiry part, described current second is obtained The corresponding global feature URL link of URL link；

Finally, establishing the corresponding relationship between current second URL link and the global feature URL link, and will The corresponding relationship is updated into the URL queue to be crawled.

Correspondingly, after obtaining above-mentioned corresponding relationship, the operation that the deduplication module 4005 executes, specifically:

Firstly, traversing to the URL queue to be crawled, it is corresponding whole to obtain current second URL link traversed Body characteristics URL link；

Then, whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtained To the corresponding duplicate checking mark of the global feature URL link；

Then, according to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple features Segment；

Then, according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link segment；

Then, multiple Hash duplicate checking is carried out to N number of recombination URL link segment, it is corresponding obtains current second URL link Duplicate checking result；

Finally, according to the duplicate checking as a result, the second URL link in the URL queue to be crawled is retained or lost Abandon operation.

It should be noted that in the present embodiment, above-mentioned described N is the integer more than or equal to 1.

However, it should be understood that being given above only a kind of the second URL link of determination and global feature URL link Between corresponding relationship, and using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the URL team to be crawled Second URL link in column carries out the specific implementation of joint duplicate removal, does not constitute and appoints to technical solution of the present invention What is limited, and in a particular application, those skilled in the art, which can according to need, to be configured, and the present invention is without limitation.

Further, in practical applications, in order to reduce the 2nd URL chain for treating and crawling and caching in URL queue as far as possible The occupancy to memory space is connect, according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, is obtained To after N number of recombination URL link segment, it can be first based on MD5 algorithm, obtained N number of recombination URL link segment is carried out respectively Compression, and then obtains the corresponding character string ciphertext of N number of recombination URL link segment, finally replaces the character string ciphertext pair The content in recombination URL link segment answered.

Correspondingly, described that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain The operation of corresponding duplicate checking result is connect, specifically:

Firstly, extracting the corresponding character string ciphertext of N number of recombination URL link segment, chosen from N number of character string ciphertext any One character string ciphertext carries out K Hash processing, obtains K cryptographic Hash；

Then, join using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each Examine the corresponding spatially-variable counter setting initial count value of cryptographic Hash；

Then, K Hash processing is carried out to remaining N-1 character string ciphertext respectively, it is close obtains each remaining character string The corresponding K cryptographic Hash of text；

Then, by each corresponding K cryptographic Hash random hash of residue character string ciphertext to institute's bit vector space, and It is adjacent with reference to cryptographic Hash with any one；

Then, head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute The cryptographic Hash in bit vector space is inserted into a preset characters；

Finally, counting each with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to described default The number of character determines the corresponding duplicate checking result of current second URL link.

It should be noted that in the present embodiment, above-mentioned described K is the integer more than or equal to 2.

However, it should be understood that being given above the corresponding duplicate checking result of only a kind of current second URL link of acquisition Specific implementation, any restriction, in a particular application, those skilled in the art are not constituted to technical solution of the present invention Member, which can according to need, to be configured, and the present invention is without limitation.

In addition, in practical applications, in order to further reduce the occupancy to memory space, to the URL to be crawled After the second URL link in queue carries out joint duplicate removal, it is also based on MD5 algorithm, to URL to be crawled described in after duplicate removal The second URL link of each of queue is compressed, and then obtains the corresponding character string ciphertext of each second URL link； The character string ciphertext is finally replaced into the content in corresponding second URL link, to compress URL to be crawled as far as possible The second URL link in queue reduces the occupancy to memory space.

By foregoing description it is not difficult to find out that, the link duplicate removal device provided in this embodiment based on web crawlers, by adopting With the counting bloom filter of chain feature, and in conjunction with multiple Hash to the 2nd URL chain cached in the URL queue to be crawled The whole joint duplicate removal with part of row is tapped into be effectively improved to reduce the False Rate of counting bloom filter as far as possible The performance of web crawlers, information needed for enabling web crawlers quickly to obtain people improve user as far as possible Experience.

It should be noted that workflow described above is only schematical, not to protection model of the invention Enclose composition limit, in practical applications, those skilled in the art can select according to the actual needs part therein or It all achieves the purpose of the solution of this embodiment, herein with no restrictions.

In addition, the not technical detail of detailed description in the present embodiment, reference can be made to provided by any embodiment of the invention Link De-weight method based on web crawlers, details are not described herein again.

Based on the first embodiment of the above-mentioned link duplicate removal device based on web crawlers, propose that the present invention is based on web crawlers Link duplicate removal device second embodiment.

In the present embodiment, the link duplicate removal device based on web crawlers further includes removing module.

Specifically, the removing module, in URL queue to be crawled described in judging after duplicate removal with the presence or absence of having visited The second URL link asked.

Correspondingly, it if there is the second URL link for having accessed in the URL queue to be crawled, has been accessed described Second URL link is deleted from the URL queue to be crawled；Otherwise, continue to monitor second in the URL queue to be crawled URL link, and judge whether there is the second URL link accessed.

However, it should be understood that the above is only for example, not constituting any limit to technical solution of the present invention Fixed, in a particular application, those skilled in the art, which can according to need, to be configured, and the present invention is without limitation.

By foregoing description it is not difficult to find out that, the link duplicate removal device provided in this embodiment based on web crawlers, by fixed When or real-time detection described in URL queue to be crawled the second URL link access situation, and detecting the URL to be crawled When there is the second URL link being accessed in queue, the second accessed URL link is deleted from URL queue to be crawled It removes, thereby may be ensured that the second URL link cached in the URL queue to be crawled is the second not visited URL link, It avoids web crawlers and crawling for identical data is repeated according to same second URL link, further improve network and climb The performance of worm.

In addition, it should be noted that, herein, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that process, method, article or system including a series of elements are not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or system Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or system including the element.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as read-only memory (Read Only Memory, ROM)/RAM, magnetic disk, CD), including some instructions are used so that one Terminal device (can be mobile phone, computer, server or the network equipment etc.) executes side described in each embodiment of the present invention Method.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of link De-weight method based on web crawlers, which is characterized in that the described method comprises the following steps:

In the data grabber request for receiving agricultural product to be analyzed, platform to be visited is extracted from data grabber request The link of first uniform resource position mark URL；

After receiving the response that the platform to be visited is made according to the access request, first URL link pair is grabbed The data information in the page answered；

The data information is parsed, obtains the second URL link embedded in the page, and by the 2nd URL chain It connects and is added to URL queue to be crawled；

Using the counting bloom filter of chain feature, and in conjunction with multiple Hash to described in the URL queue to be crawled Two URL links carry out joint duplicate removal.

2. the method as described in claim 1, which is characterized in that the counting bloom filter using chain feature, and tie Before closing the step of multiple Hash carries out joint duplicate removal to second URL link in the URL queue to be crawled, the side Method further include:

The URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, extracts institute State protocol type part, path sections and the inquiry part of current second URL link；

According to the protocol type part, the path sections and the inquiry part, current second URL link pair is obtained The global feature URL link answered；

The corresponding relationship between current second URL link and the global feature URL link is established, and the correspondence is closed System updates into the URL queue to be crawled.

3. method according to claim 2, which is characterized in that the counting bloom filter using chain feature, and tie Close the step of multiple Hash carries out joint duplicate removal to second URL link in the URL queue to be crawled, comprising:

The URL queue to be crawled is traversed, the corresponding global feature URL of current second URL link traversed is obtained Link；

Whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, is obtained described whole The corresponding duplicate checking mark of body characteristics URL link；

According to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link piece Section, the N are the integer more than or equal to 1；

Multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtains the corresponding duplicate checking knot of current second URL link Fruit；

According to the duplicate checking as a result, being retained or being abandoned operation to the second URL link in the URL queue to be crawled.

4. method as claimed in claim 3, which is characterized in that it is characterized in that, described recombinated according to preset URL link is advised Then, after the step of recombinating to the multiple characteristic fragment, obtaining N number of recombination URL link segment, the method is also wrapped It includes:

Based on MD5 algorithm, obtained N number of recombination URL link segment is compressed respectively, obtains N number of recombination URL link segment Corresponding character string ciphertext；

5. method as claimed in claim 4, which is characterized in that described to look into the multiple Hash of N number of recombination URL link segment progress Weight, the step of obtaining current second URL link corresponding duplicate checking result, comprising:

The corresponding character string ciphertext of N number of recombination URL link segment is extracted, chooses any one character from N number of character string ciphertext Ciphertext of going here and there carries out K Hash processing, obtains K cryptographic Hash, and the K is the integer more than or equal to 2；

Cryptographic Hash is referred to using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each Initial count value is arranged in corresponding spatially-variable counter；

K Hash processing is carried out to remaining N-1 character string ciphertext respectively, obtains the corresponding K of each residue character string ciphertext A cryptographic Hash；

By the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and with it is any one It is a adjacent with reference to cryptographic Hash；

Head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute's bit vector The cryptographic Hash in space is inserted into a preset characters；

Each is counted with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to of the preset characters Number, determines the corresponding duplicate checking result of current second URL link.

6. such as method described in any one of claim 1 to 5, which is characterized in that the grand mistake of counting cloth using chain feature Filter, and the step of joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash it Afterwards, the method also includes:

Based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is compressed, is obtained The corresponding character string ciphertext of each second URL link；

7. such as method described in any one of claim 1 to 5, which is characterized in that the grand mistake of counting cloth using chain feature Filter, and the step of joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash it Afterwards, the method also includes:

If there is the second URL link accessed in the URL queue to be crawled, by second URL link accessed It is deleted from the URL queue to be crawled.

8. a kind of link duplicate removal device based on web crawlers, which is characterized in that described device includes:

Extraction module, for being mentioned from data grabber request in the data grabber request for receiving agricultural product to be analyzed The first uniform resource position mark URL of platform to be visited is taken to link；

Handling module, for after receiving the response that the platform to be visited is made according to the access request, described in crawl Data information in the corresponding page of first URL link；

Parsing module obtains the second URL link embedded in the page, and will for parsing to the data information Second URL link is added to URL queue to be crawled；

Deduplication module, for the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL team to be crawled Second URL link in column carries out joint duplicate removal.

9. heavy equipment is gone in a kind of link based on web crawlers, which is characterized in that the equipment include: memory, processor and Master control program is removed in the link based on web crawlers that being stored in can run on the memory and on the processor, described to be based on The link of web crawlers goes master control program to be arranged for carrying out the chain based on web crawlers as described in any one of claims 1 to 7 The step of connecing De-weight method.

10. a kind of computer readable storage medium, which is characterized in that be stored on the computer readable storage medium based on net Master control program is removed in the link of network crawler, realizes when the link based on web crawlers goes master control program to be executed by processor as right is wanted The step of seeking 1 to 7 described in any item link De-weight methods based on web crawlers.